Fwd: discussion of input conversions

deneche abdelhakim Thu, 25 Aug 2011 00:44:05 -0700

I don't know how relevant it is, but Mahout's Decision Forests already has
some form of Instance. It was inspired from Weka (because of the original
implementation of Random Forests in there) but was adapted/simplified to
Mahout.

The code still uses Vector(s) to store that data using an Instance class
(it's just a Vector with a "label" and a unique "id"). It can store both
continuous and categorical attributes (it converts them to integers). A
special class "Dataset" is used to identify the nature of each attribute and
to convert categorical attributes to their integers counterparts. This class
has somehow the same information available in ARFF header. In fact, if you
remove the ARFF's header the remaining data is similar to CSV and Mahout DF
loads it just fine. The dataset can also be used to "ignore" some of the
attributes when loading the data.
One problem I found with this representation is where to store the Dataset
information, in my case I wrote a "Describe" tool that goes through the
whole dataset (or a subset that has all categorical values in it) and
generates a .dataset file which is loaded with the data.
Keep in mind though, that those classes were written specifically to be used
by DecisionForests.

---------- Forwarded message ----------
From: Lance Norskog <[email protected]>
Date: Thu, Aug 25, 2011 at 5:29 AM
Subject: Re: discussion of input conversions
To: [email protected]

I would like a comment Writable at the beginning and/or end of a
SequenceFile. They could be just a StringWritable storing json. The
one at the beginning would be the classic metadata header. The one at
the end might have stats about the vectors it stores, built during
writing.

For classification outputs in particular, I would like to know the
tuning knobs used, raw classification outputs and the confusion
matrix. Especially if I can ask for any or all to be saved.

This would still be the same SequenceFile format as before, just with
metadata objects.

On Wed, Aug 24, 2011 at 4:22 PM, Grant Ingersoll <[email protected]>
wrote:
>
> On Aug 24, 2011, at 4:04 PM, Ted Dunning wrote:
>
>> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]>
wrote:
>>
>>> My initial inclination is -1 on adding a GPL dependency.
>>>
>>
>> If it were maven accessible, I might wish to temper your opinion, but I
>> think I agree.
>
> I don't think that is even enough.  It has to be completely optional and
even then it needs PMC and legal.  See
http://www.apache.org/legal/3party.html#options
>
>>
>>
>>> Can you spell out exactly what is meant by needing a "general input
format"
>>> and "general transfer format".
>>
>>
>> Well, at the least, it would be useful to be able to retain the
distinction
>> between different fields.  I would like to be able to have multiple
fields,
>> each with a particular type of data (categorical, continuous, word-like
or
>> text-like).
>>
>> We currently take in raw text, and then vectorize it.   Are Vectors (with
>>> either hashed encoding,
>>
>> or with a dictionary file) not suitable as a format for some reason?
>>>
>>
>> Text is insufficient because it can't really represent fielded data, nor
>> continuous variables.  This is basically the same argument that led
Lucene
>> to have something more than just text.
>>
>> You can abuse text in many ways, but it isn't very satisfactory.
>>
>> Vectors are insufficient because I can't retain the fielded nature of the
>> input.  I would like to have a a feature sharding system use some fields
but
>> not others or even use some values in some fields, but not others.  I
>> certainly can't do that once I have used a hashed encoding.
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
>

--
Lance Norskog
[email protected]

Fwd: discussion of input conversions

Reply via email to