Hey gang,

  I'm working on getting MAHOUT-206 (splitting SparseVector into the two
primary specialized forms - map-based and array based),  and MAHOUT-205
 (pulling Writable out of the math package) finished up, but in digging into
the unit tests and usages of Vectors as Writable thingees, I come upon
either some annoyance at how Hadoop deals with serialization, or else a
misunderstanding of how one goes about using Writables properly when you
want to play nicely with inheritance and nice abstraction.

  If you've got a SequenceFile, you'd like to have it be a
SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd
like the serialization technique and storage to be decoupled from the
algorithims using such a set of data (for example, your algorithm shouldn't
care whether there's a SparseVector or a DenseVector - it may be optimal for
one case over the other, but that's another story).  What is the right way
to do this with Writables?

  From what I can tell, in SequenceFile.Writer#append(Object key, Object
value) (why on earth is it taking Objects?  shouldn't these be Writables?),
it does an explicit check of key.getClass == this.keyClass and
value.getClass() == this.valueClass, which won't do any subclass matching
(and so will fail if value.getClass() is DenseVector.class, and valueClass
is SparseVector.class, or just Vector.class).

  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
would be to have one overall VectorWritable class, which can
serialize/deserialize all Vector implementations.  Right?  This is how I've
in general looked at Writables - they tend very much to be very loosely
object oriented, in the sense that they are typically just wrappers around
some data object, and provide marshalling/unmarshalling capabilities for
said object (but the Writable itself rarely (ever?) actually also implements
any useful interface the held object implements - when you want said object,
you Writable.get() on it to fetch the inner guy).

  Of course, while writing out generically classed vectors is easy without
knowing internals (numNonDefaultElements() == size() tells you whether it's
sparse or not, and in fact you could optimize this further by saying that if
numNonDefaultElements is greater than about size()/2, then switch to a Dense
representation), reading in and choosing which vector class to instantiate
is a pain - you need to either move all the write(DataOutput) and
readFields(DataInput) methods from the vector implementations into the new
VectorWritable, and have a big switch statement deciding which one to call,
or else you need Writable subclasses of each and every concrete vector
implementation which has said methods (and go back and make all nontransient
fields protected instead of private, so the subclass can properly serialize
out said data) - and even this has the big switch effectively, somewhere.
 My default feeling is the latter technique is the way to go, but it still
looks a little ugly.

  Or is there a better way to do this?  What I really think is necessary, as
an end-goal, is for us to be able to spit out int + Vector key-value pairs
from mappers and reducers, and not need to know which kind they are in the
mapper or reducer (because you may get them from doing
someMatrix.times(someVector), in which case all you know is that you have a
Vector), as well as do the other direction (so you can read a
SequenceFile<IntWritable, VectorWritable> and just pop out some Vector
instances).

  -jake

Reply via email to