On Mon, Jun 7, 2010 at 3:02 AM, Ted Dunning <[email protected]> wrote:
> > Drew, I especially would like to hear what you think about how this would > relate to the Avro document stuff you did. > At first read of your description it seems like we could consider implementing a csv -> avro structured document mapping and then modify the vectorization/learning code to take avro structured documents (asd) as input. Users develop their own processors to convert from their format to asd, or perhaps something in the vein of Solr's DataImportHandler could be used as a general tool to load from databases or xml into asd. At a lower level this makes your proposed lucene IndexWriter interface implementation very attractive. I am a little skeptical about the utility of avro for storing the vectors themselves. Some very early initial tests suggested that using avro reflection to derive a schema from an existing class (such as one of mahout's vector classes), did not produce a large win in terms of performance or space but more work in that direction still needs to happen. I'll take a look at how the data loading relates with the rest of the code in your patch and come back with questions. The approach to the vectorization sounds like a pretty neat idea. I'm interested in seeing how the vector + trace to human readable dump code works too. Drew
