On Aug 24, 2011, at 4:04 PM, Ted Dunning wrote: > On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote: > >> My initial inclination is -1 on adding a GPL dependency. >> > > If it were maven accessible, I might wish to temper your opinion, but I > think I agree.
I don't think that is even enough. It has to be completely optional and even then it needs PMC and legal. See http://www.apache.org/legal/3party.html#options > > >> Can you spell out exactly what is meant by needing a "general input format" >> and "general transfer format". > > > Well, at the least, it would be useful to be able to retain the distinction > between different fields. I would like to be able to have multiple fields, > each with a particular type of data (categorical, continuous, word-like or > text-like). > > We currently take in raw text, and then vectorize it. Are Vectors (with >> either hashed encoding, > > or with a dictionary file) not suitable as a format for some reason? >> > > Text is insufficient because it can't really represent fielded data, nor > continuous variables. This is basically the same argument that led Lucene > to have something more than just text. > > You can abuse text in many ways, but it isn't very satisfactory. > > Vectors are insufficient because I can't retain the fielded nature of the > input. I would like to have a a feature sharding system use some fields but > not others or even use some values in some fields, but not others. I > certainly can't do that once I have used a hashed encoding. -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
