Re: EncodedVectorsFromSequenceFiles

Grant Ingersoll Fri, 04 Nov 2011 10:07:14 -0700

On Nov 4, 2011, at 12:33 PM, Ted Dunning wrote:

> Yes.  Bulk encoding makes some sense if you can lock down some of the
> design options.
> 
> IN common with other encoding, there is the entire question of
> segmentation.  The hashed encoding adds some additional flexibility.  This
> includes:
> 
> 1) what is the final dimensionality?
> 
> 2) how many probes?
> 
> 3) does the data have fields?  Should we care?
> 
> 4) what about numerical fields?
> 
> The 20 news groups and email corpora are good examples of the third case.
> Most of our examples assume that these are just text and that symbols in
> any field mean the same.  This might not be true.  For instance, an email
> address might mean something different in a header field than in text where
> it would indicate a response to a posting.  Another case is words in
> subject lines are subtly different from words in text bodies.
>


I'm going to start w/ the basics of, key + blob of text.  Run them through the 
LuceneTextValueEncoder.  I have a feeling we will be able to abstract a 
Vectorizer framework that is more conducive to all these things.

> 
> 
> On Fri, Nov 4, 2011 at 8:31 AM, Grant Ingersoll <[email protected]> wrote:
> 
>> I'm still wrapping my head around workflow for SGD, so bear with me.
>> 
>> We've got SparseVectorsFromSequenceFiles which can take the output of
>> SequenceFilesFromDirectory, etc. (i.e. Seq File w/ 1 document per line) and
>> convert to sparse vectors.  Does it make sense to have a similar class for
>> bulk encoding lots of data for SGD using the FeatureVectorEncoder stuff?
>> Seems then, we could also use the SplitInput class to split into training
>> and test.   Then, a simple driver program could run over it and do the
>> train/test.  The added benefit is the toolchains start to match our other
>> processes, at least examples-wise.
>> 
>> I would suppose that such an implementation would use the Lucene encoder
>> stuff.
>> 
>> Is this reasonable?
>> 
>> -Grant

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: EncodedVectorsFromSequenceFiles

Reply via email to