FWIW, 
http://hadoop.markmail.org/message/jr4cbem46erlhgzu?q=gsingers+from:%22Grant+Ingersoll%22
 got no response.

Totally agree on everything, so if you can make it work, +1!  I think up until 
now, we basically took the "let's punt" approach.  I definitely would like to 
remove the need for a user to, in 99% of the cases, ever think about which 
vector implementation they are using.

Perhaps it might be worth delving into Hadoop at a bit lower level and see if 
there is anything that can be done there.  Of course, that could be a rat's 
nest.

-Grant


On Jan 5, 2010, at 4:18 AM, Jake Mannix wrote:

> Hey gang,
> 
>  I'm working on getting MAHOUT-206 (splitting SparseVector into the two
> primary specialized forms - map-based and array based),  and MAHOUT-205
> (pulling Writable out of the math package) finished up, but in digging into
> the unit tests and usages of Vectors as Writable thingees, I come upon
> either some annoyance at how Hadoop deals with serialization, or else a
> misunderstanding of how one goes about using Writables properly when you
> want to play nicely with inheritance and nice abstraction.
> 
>  If you've got a SequenceFile, you'd like to have it be a
> SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd
> like the serialization technique and storage to be decoupled from the
> algorithims using such a set of data (for example, your algorithm shouldn't
> care whether there's a SparseVector or a DenseVector - it may be optimal for
> one case over the other, but that's another story).  What is the right way
> to do this with Writables?
> 
>  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> value) (why on earth is it taking Objects?  shouldn't these be Writables?),
> it does an explicit check of key.getClass == this.keyClass and
> value.getClass() == this.valueClass, which won't do any subclass matching
> (and so will fail if value.getClass() is DenseVector.class, and valueClass
> is SparseVector.class, or just Vector.class).
> 
>  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> would be to have one overall VectorWritable class, which can
> serialize/deserialize all Vector implementations.  Right?  This is how I've
> in general looked at Writables - they tend very much to be very loosely
> object oriented, in the sense that they are typically just wrappers around
> some data object, and provide marshalling/unmarshalling capabilities for
> said object (but the Writable itself rarely (ever?) actually also implements
> any useful interface the held object implements - when you want said object,
> you Writable.get() on it to fetch the inner guy).
> 
>  Of course, while writing out generically classed vectors is easy without
> knowing internals (numNonDefaultElements() == size() tells you whether it's
> sparse or not, and in fact you could optimize this further by saying that if
> numNonDefaultElements is greater than about size()/2, then switch to a Dense
> representation), reading in and choosing which vector class to instantiate
> is a pain - you need to either move all the write(DataOutput) and
> readFields(DataInput) methods from the vector implementations into the new
> VectorWritable, and have a big switch statement deciding which one to call,
> or else you need Writable subclasses of each and every concrete vector
> implementation which has said methods (and go back and make all nontransient
> fields protected instead of private, so the subclass can properly serialize
> out said data) - and even this has the big switch effectively, somewhere.
> My default feeling is the latter technique is the way to go, but it still
> looks a little ugly.
> 
>  Or is there a better way to do this?  What I really think is necessary, as
> an end-goal, is for us to be able to spit out int + Vector key-value pairs
> from mappers and reducers, and not need to know which kind they are in the
> mapper or reducer (because you may get them from doing
> someMatrix.times(someVector), in which case all you know is that you have a
> Vector), as well as do the other direction (so you can read a
> SequenceFile<IntWritable, VectorWritable> and just pop out some Vector
> instances).
> 
>  -jake

Reply via email to