Yes.  Bulk encoding makes some sense if you can lock down some of the
design options.

IN common with other encoding, there is the entire question of
segmentation.  The hashed encoding adds some additional flexibility.  This
includes:

1) what is the final dimensionality?

2) how many probes?

3) does the data have fields?  Should we care?

4) what about numerical fields?

The 20 news groups and email corpora are good examples of the third case.
 Most of our examples assume that these are just text and that symbols in
any field mean the same.  This might not be true.  For instance, an email
address might mean something different in a header field than in text where
it would indicate a response to a posting.  Another case is words in
subject lines are subtly different from words in text bodies.



On Fri, Nov 4, 2011 at 8:31 AM, Grant Ingersoll <[email protected]> wrote:

> I'm still wrapping my head around workflow for SGD, so bear with me.
>
> We've got SparseVectorsFromSequenceFiles which can take the output of
> SequenceFilesFromDirectory, etc. (i.e. Seq File w/ 1 document per line) and
> convert to sparse vectors.  Does it make sense to have a similar class for
> bulk encoding lots of data for SGD using the FeatureVectorEncoder stuff?
>  Seems then, we could also use the SplitInput class to split into training
> and test.   Then, a simple driver program could run over it and do the
> train/test.  The added benefit is the toolchains start to match our other
> processes, at least examples-wise.
>
> I would suppose that such an implementation would use the Lucene encoder
> stuff.
>
> Is this reasonable?
>
> -Grant

Reply via email to