FWIW, http://hadoop.markmail.org/message/jr4cbem46erlhgzu?q=gsingers+from:%22Grant+Ingersoll%22 got no response.
Totally agree on everything, so if you can make it work, +1! I think up until now, we basically took the "let's punt" approach. I definitely would like to remove the need for a user to, in 99% of the cases, ever think about which vector implementation they are using. Perhaps it might be worth delving into Hadoop at a bit lower level and see if there is anything that can be done there. Of course, that could be a rat's nest. -Grant On Jan 5, 2010, at 4:18 AM, Jake Mannix wrote: > Hey gang, > > I'm working on getting MAHOUT-206 (splitting SparseVector into the two > primary specialized forms - map-based and array based), and MAHOUT-205 > (pulling Writable out of the math package) finished up, but in digging into > the unit tests and usages of Vectors as Writable thingees, I come upon > either some annoyance at how Hadoop deals with serialization, or else a > misunderstanding of how one goes about using Writables properly when you > want to play nicely with inheritance and nice abstraction. > > If you've got a SequenceFile, you'd like to have it be a > SequenceFile<IntWritable, Vector>, not some fixed subclass, because you'd > like the serialization technique and storage to be decoupled from the > algorithims using such a set of data (for example, your algorithm shouldn't > care whether there's a SparseVector or a DenseVector - it may be optimal for > one case over the other, but that's another story). What is the right way > to do this with Writables? > > From what I can tell, in SequenceFile.Writer#append(Object key, Object > value) (why on earth is it taking Objects? shouldn't these be Writables?), > it does an explicit check of key.getClass == this.keyClass and > value.getClass() == this.valueClass, which won't do any subclass matching > (and so will fail if value.getClass() is DenseVector.class, and valueClass > is SparseVector.class, or just Vector.class). > > To avoid this kind of mess, it seems the proper approach in MAHOUT-205 > would be to have one overall VectorWritable class, which can > serialize/deserialize all Vector implementations. Right? This is how I've > in general looked at Writables - they tend very much to be very loosely > object oriented, in the sense that they are typically just wrappers around > some data object, and provide marshalling/unmarshalling capabilities for > said object (but the Writable itself rarely (ever?) actually also implements > any useful interface the held object implements - when you want said object, > you Writable.get() on it to fetch the inner guy). > > Of course, while writing out generically classed vectors is easy without > knowing internals (numNonDefaultElements() == size() tells you whether it's > sparse or not, and in fact you could optimize this further by saying that if > numNonDefaultElements is greater than about size()/2, then switch to a Dense > representation), reading in and choosing which vector class to instantiate > is a pain - you need to either move all the write(DataOutput) and > readFields(DataInput) methods from the vector implementations into the new > VectorWritable, and have a big switch statement deciding which one to call, > or else you need Writable subclasses of each and every concrete vector > implementation which has said methods (and go back and make all nontransient > fields protected instead of private, so the subclass can properly serialize > out said data) - and even this has the big switch effectively, somewhere. > My default feeling is the latter technique is the way to go, but it still > looks a little ugly. > > Or is there a better way to do this? What I really think is necessary, as > an end-goal, is for us to be able to spit out int + Vector key-value pairs > from mappers and reducers, and not need to know which kind they are in the > mapper or reducer (because you may get them from doing > someMatrix.times(someVector), in which case all you know is that you have a > Vector), as well as do the other direction (so you can read a > SequenceFile<IntWritable, VectorWritable> and just pop out some Vector > instances). > > -jake