Hey gang, I'm working on getting MAHOUT-206 (splitting SparseVector into the two primary specialized forms - map-based and array based), and MAHOUT-205 (pulling Writable out of the math package) finished up, but in digging into the unit tests and usages of Vectors as Writable thingees, I come upon either some annoyance at how Hadoop deals with serialization, or else a misunderstanding of how one goes about using Writables properly when you want to play nicely with inheritance and nice abstraction.
If you've got a SequenceFile, you'd like to have it be a SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd like the serialization technique and storage to be decoupled from the algorithims using such a set of data (for example, your algorithm shouldn't care whether there's a SparseVector or a DenseVector - it may be optimal for one case over the other, but that's another story). What is the right way to do this with Writables? From what I can tell, in SequenceFile.Writer#append(Object key, Object value) (why on earth is it taking Objects? shouldn't these be Writables?), it does an explicit check of key.getClass == this.keyClass and value.getClass() == this.valueClass, which won't do any subclass matching (and so will fail if value.getClass() is DenseVector.class, and valueClass is SparseVector.class, or just Vector.class). To avoid this kind of mess, it seems the proper approach in MAHOUT-205 would be to have one overall VectorWritable class, which can serialize/deserialize all Vector implementations. Right? This is how I've in general looked at Writables - they tend very much to be very loosely object oriented, in the sense that they are typically just wrappers around some data object, and provide marshalling/unmarshalling capabilities for said object (but the Writable itself rarely (ever?) actually also implements any useful interface the held object implements - when you want said object, you Writable.get() on it to fetch the inner guy). Of course, while writing out generically classed vectors is easy without knowing internals (numNonDefaultElements() == size() tells you whether it's sparse or not, and in fact you could optimize this further by saying that if numNonDefaultElements is greater than about size()/2, then switch to a Dense representation), reading in and choosing which vector class to instantiate is a pain - you need to either move all the write(DataOutput) and readFields(DataInput) methods from the vector implementations into the new VectorWritable, and have a big switch statement deciding which one to call, or else you need Writable subclasses of each and every concrete vector implementation which has said methods (and go back and make all nontransient fields protected instead of private, so the subclass can properly serialize out said data) - and even this has the big switch effectively, somewhere. My default feeling is the latter technique is the way to go, but it still looks a little ugly. Or is there a better way to do this? What I really think is necessary, as an end-goal, is for us to be able to spit out int + Vector key-value pairs from mappers and reducers, and not need to know which kind they are in the mapper or reducer (because you may get them from doing someMatrix.times(someVector), in which case all you know is that you have a Vector), as well as do the other direction (so you can read a SequenceFile<IntWritable, VectorWritable> and just pop out some Vector instances). -jake