On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:
> My initial inclination is -1 on adding a GPL dependency. > If it were maven accessible, I might wish to temper your opinion, but I think I agree. > Can you spell out exactly what is meant by needing a "general input format" > and "general transfer format". Well, at the least, it would be useful to be able to retain the distinction between different fields. I would like to be able to have multiple fields, each with a particular type of data (categorical, continuous, word-like or text-like). We currently take in raw text, and then vectorize it. Are Vectors (with > either hashed encoding, or with a dictionary file) not suitable as a format for some reason? > Text is insufficient because it can't really represent fielded data, nor continuous variables. This is basically the same argument that led Lucene to have something more than just text. You can abuse text in many ways, but it isn't very satisfactory. Vectors are insufficient because I can't retain the fielded nature of the input. I would like to have a a feature sharding system use some fields but not others or even use some values in some fields, but not others. I certainly can't do that once I have used a hashed encoding.
