On Nov 4, 2011, at 12:33 PM, Ted Dunning wrote: > Yes. Bulk encoding makes some sense if you can lock down some of the > design options. > > IN common with other encoding, there is the entire question of > segmentation. The hashed encoding adds some additional flexibility. This > includes: > > 1) what is the final dimensionality? > > 2) how many probes? > > 3) does the data have fields? Should we care? > > 4) what about numerical fields? > > The 20 news groups and email corpora are good examples of the third case. > Most of our examples assume that these are just text and that symbols in > any field mean the same. This might not be true. For instance, an email > address might mean something different in a header field than in text where > it would indicate a response to a posting. Another case is words in > subject lines are subtly different from words in text bodies. >
I'm going to start w/ the basics of, key + blob of text. Run them through the LuceneTextValueEncoder. I have a feeling we will be able to abstract a Vectorizer framework that is more conducive to all these things. > > > On Fri, Nov 4, 2011 at 8:31 AM, Grant Ingersoll <[email protected]> wrote: > >> I'm still wrapping my head around workflow for SGD, so bear with me. >> >> We've got SparseVectorsFromSequenceFiles which can take the output of >> SequenceFilesFromDirectory, etc. (i.e. Seq File w/ 1 document per line) and >> convert to sparse vectors. Does it make sense to have a similar class for >> bulk encoding lots of data for SGD using the FeatureVectorEncoder stuff? >> Seems then, we could also use the SplitInput class to split into training >> and test. Then, a simple driver program could run over it and do the >> train/test. The added benefit is the toolchains start to match our other >> processes, at least examples-wise. >> >> I would suppose that such an implementation would use the Lucene encoder >> stuff. >> >> Is this reasonable? >> >> -Grant -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
