I don't know how relevant it is, but Mahout's Decision Forests already has some form of Instance. It was inspired from Weka (because of the original implementation of Random Forests in there) but was adapted/simplified to Mahout.
The code still uses Vector(s) to store that data using an Instance class (it's just a Vector with a "label" and a unique "id"). It can store both continuous and categorical attributes (it converts them to integers). A special class "Dataset" is used to identify the nature of each attribute and to convert categorical attributes to their integers counterparts. This class has somehow the same information available in ARFF header. In fact, if you remove the ARFF's header the remaining data is similar to CSV and Mahout DF loads it just fine. The dataset can also be used to "ignore" some of the attributes when loading the data. One problem I found with this representation is where to store the Dataset information, in my case I wrote a "Describe" tool that goes through the whole dataset (or a subset that has all categorical values in it) and generates a .dataset file which is loaded with the data. Keep in mind though, that those classes were written specifically to be used by DecisionForests. ---------- Forwarded message ---------- From: Lance Norskog <[email protected]> Date: Thu, Aug 25, 2011 at 5:29 AM Subject: Re: discussion of input conversions To: [email protected] I would like a comment Writable at the beginning and/or end of a SequenceFile. They could be just a StringWritable storing json. The one at the beginning would be the classic metadata header. The one at the end might have stats about the vectors it stores, built during writing. For classification outputs in particular, I would like to know the tuning knobs used, raw classification outputs and the confusion matrix. Especially if I can ask for any or all to be saved. This would still be the same SequenceFile format as before, just with metadata objects. On Wed, Aug 24, 2011 at 4:22 PM, Grant Ingersoll <[email protected]> wrote: > > On Aug 24, 2011, at 4:04 PM, Ted Dunning wrote: > >> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote: >> >>> My initial inclination is -1 on adding a GPL dependency. >>> >> >> If it were maven accessible, I might wish to temper your opinion, but I >> think I agree. > > I don't think that is even enough. It has to be completely optional and even then it needs PMC and legal. See http://www.apache.org/legal/3party.html#options > >> >> >>> Can you spell out exactly what is meant by needing a "general input format" >>> and "general transfer format". >> >> >> Well, at the least, it would be useful to be able to retain the distinction >> between different fields. I would like to be able to have multiple fields, >> each with a particular type of data (categorical, continuous, word-like or >> text-like). >> >> We currently take in raw text, and then vectorize it. Are Vectors (with >>> either hashed encoding, >> >> or with a dictionary file) not suitable as a format for some reason? >>> >> >> Text is insufficient because it can't really represent fielded data, nor >> continuous variables. This is basically the same argument that led Lucene >> to have something more than just text. >> >> You can abuse text in many ways, but it isn't very satisfactory. >> >> Vectors are insufficient because I can't retain the fielded nature of the >> input. I would like to have a a feature sharding system use some fields but >> not others or even use some values in some fields, but not others. I >> certainly can't do that once I have used a hashed encoding. > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > -- Lance Norskog [email protected]
