2010/1/19 Robin Anil <robin.a...@gmail.com>: > I like this idea very much. More like adding metadata over sparse vectors > > To the ideo make it more verbose > Vectors currently have a name. Which is the id of the original document/data > point the vector points to ? It could also have fields or labels in which > the vector belong to. > > Question is, is there any other metadata we want to associate with a Vector > ?
I am starting to write an implementation for the newly created https://issues.apache.org/jira/browse/MAHOUT-262, I will attach it as a patch in an our our two (writing some tests first). Right now I am serializing the label indices outside the Vectors as a prefix to the datastream either as simple int[] or single int. Storing class labels as strings might be not as efficient since class labels are the same for all instance and the data is gonna be very redundant with as an impact on the dataset size and then performance of the processing algorithms. Most state of the art sequential classification algorithms are simple linear models variants that mostly compute dot products and and are thus IO bound. I think it is thus important to have packed representations of the vectorized dataset to avoid the classification mapper to IO starve and waste CPU cycles. Maybe further discussion should happen as comments to MAHOUT-262. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name