I would like a comment Writable at the beginning and/or end of a SequenceFile. They could be just a StringWritable storing json. The one at the beginning would be the classic metadata header. The one at the end might have stats about the vectors it stores, built during writing.
For classification outputs in particular, I would like to know the tuning knobs used, raw classification outputs and the confusion matrix. Especially if I can ask for any or all to be saved. This would still be the same SequenceFile format as before, just with metadata objects. On Wed, Aug 24, 2011 at 4:22 PM, Grant Ingersoll <[email protected]> wrote: > > On Aug 24, 2011, at 4:04 PM, Ted Dunning wrote: > >> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote: >> >>> My initial inclination is -1 on adding a GPL dependency. >>> >> >> If it were maven accessible, I might wish to temper your opinion, but I >> think I agree. > > I don't think that is even enough. It has to be completely optional and even > then it needs PMC and legal. See > http://www.apache.org/legal/3party.html#options > >> >> >>> Can you spell out exactly what is meant by needing a "general input format" >>> and "general transfer format". >> >> >> Well, at the least, it would be useful to be able to retain the distinction >> between different fields. I would like to be able to have multiple fields, >> each with a particular type of data (categorical, continuous, word-like or >> text-like). >> >> We currently take in raw text, and then vectorize it. Are Vectors (with >>> either hashed encoding, >> >> or with a dictionary file) not suitable as a format for some reason? >>> >> >> Text is insufficient because it can't really represent fielded data, nor >> continuous variables. This is basically the same argument that led Lucene >> to have something more than just text. >> >> You can abuse text in many ways, but it isn't very satisfactory. >> >> Vectors are insufficient because I can't retain the fielded nature of the >> input. I would like to have a a feature sharding system use some fields but >> not others or even use some values in some fields, but not others. I >> certainly can't do that once I have used a hashed encoding. > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > -- Lance Norskog [email protected]
