I'm still wrapping my head around workflow for SGD, so bear with me. We've got SparseVectorsFromSequenceFiles which can take the output of SequenceFilesFromDirectory, etc. (i.e. Seq File w/ 1 document per line) and convert to sparse vectors. Does it make sense to have a similar class for bulk encoding lots of data for SGD using the FeatureVectorEncoder stuff? Seems then, we could also use the SplitInput class to split into training and test. Then, a simple driver program could run over it and do the train/test. The added benefit is the toolchains start to match our other processes, at least examples-wise.
I would suppose that such an implementation would use the Lucene encoder stuff. Is this reasonable? -Grant
