On Aug 24, 2011, at 4:04 PM, Ted Dunning wrote:

> On Wed, Aug 24, 2011 at 3:16 PM, Jake Mannix <[email protected]> wrote:
> 
>> My initial inclination is -1 on adding a GPL dependency.
>> 
> 
> If it were maven accessible, I might wish to temper your opinion, but I
> think I agree.

I don't think that is even enough.  It has to be completely optional and even 
then it needs PMC and legal.  See 
http://www.apache.org/legal/3party.html#options

> 
> 
>> Can you spell out exactly what is meant by needing a "general input format"
>> and "general transfer format".
> 
> 
> Well, at the least, it would be useful to be able to retain the distinction
> between different fields.  I would like to be able to have multiple fields,
> each with a particular type of data (categorical, continuous, word-like or
> text-like).
> 
> We currently take in raw text, and then vectorize it.   Are Vectors (with
>> either hashed encoding,
> 
> or with a dictionary file) not suitable as a format for some reason?
>> 
> 
> Text is insufficient because it can't really represent fielded data, nor
> continuous variables.  This is basically the same argument that led Lucene
> to have something more than just text.
> 
> You can abuse text in many ways, but it isn't very satisfactory.
> 
> Vectors are insufficient because I can't retain the fielded nature of the
> input.  I would like to have a a feature sharding system use some fields but
> not others or even use some values in some fields, but not others.  I
> certainly can't do that once I have used a hashed encoding.

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Reply via email to