On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:

> My initial inclination is -1 on adding a GPL dependency.
>

If it were maven accessible, I might wish to temper your opinion, but I
think I agree.


> Can you spell out exactly what is meant by needing a "general input format"
> and "general transfer format".


Well, at the least, it would be useful to be able to retain the distinction
between different fields.  I would like to be able to have multiple fields,
each with a particular type of data (categorical, continuous, word-like or
text-like).

We currently take in raw text, and then vectorize it.   Are Vectors (with
> either hashed encoding,

or with a dictionary file) not suitable as a format for some reason?
>

Text is insufficient because it can't really represent fielded data, nor
continuous variables.  This is basically the same argument that led Lucene
to have something more than just text.

You can abuse text in many ways, but it isn't very satisfactory.

Vectors are insufficient because I can't retain the fielded nature of the
input.  I would like to have a a feature sharding system use some fields but
not others or even use some values in some fields, but not others.  I
certainly can't do that once I have used a hashed encoding.

Reply via email to