I'm still wrapping my head around workflow for SGD, so bear with me.

We've got SparseVectorsFromSequenceFiles which can take the output of 
SequenceFilesFromDirectory, etc. (i.e. Seq File w/ 1 document per line) and 
convert to sparse vectors.  Does it make sense to have a similar class for bulk 
encoding lots of data for SGD using the FeatureVectorEncoder stuff?  Seems 
then, we could also use the SplitInput class to split into training and test.   
Then, a simple driver program could run over it and do the train/test.  The 
added benefit is the toolchains start to match our other processes, at least 
examples-wise.

I would suppose that such an implementation would use the Lucene encoder stuff.

Is this reasonable?

-Grant

Reply via email to