PS let's see a patch to keep discussing, I'm seeing ideas on lots of good topics here and want to take the opportunity to strike while the iron is hot and continue overhauling this.
But things like making everything a named vector is sort of stepping backwards to where we just agreed to move from -- making name a default part of all vectors. I am also not sure it is practical to use only VectorWritable because of the storage overhead, though it does in fact seem to offer the very facility alluded to in talk of a 'facade' class? I think doing things like writing optional data in Hadoop's basic serialization format is not really possible. I saw attempts in the previous code which felt fragile: read a string, if it's the class name, assume it is the name of the vector class to deserialize, otherwise assume it's a vector name... hmm. So are we on the same page about how this works now. In fact I would expect to see implementations start to specialize to one particular representation, if possible, to be more efficient. On this topic, sort of: - How about moving label bindings out to NamedVector? - How about similar restructuring of matrices? - And how about not writing "org.apache.mahout.math.RandomAccessSparseVectorWritable" whenever VectorWritable does its wrapping.. I think making the package name and "Writable" implicit is perhaps worth the loss of generality.