On Tue, Jan 5, 2010 at 1:36 AM, Sean Owen <sro...@gmail.com> wrote: > On Tue, Jan 5, 2010 at 9:18 AM, Jake Mannix <jake.man...@gmail.com> wrote: > > From what I can tell, in SequenceFile.Writer#append(Object key, Object > > value) (why on earth is it taking Objects? shouldn't these be > Writables?), > > There's also a version that takes Writables. Why, I don't know, but > assume you're triggering the other one. >
Yeah, I saw that after I wrote it. I realized why: because Serializer can take non-writable objects and write them if they are implemented to do so. > > it does an explicit check of key.getClass == this.keyClass and > > value.getClass() == this.valueClass, which won't do any subclass matching > > (and so will fail if value.getClass() is DenseVector.class, and > valueClass > > is SparseVector.class, or just Vector.class). > > Yes, I've hit this too. You can say the value or key class is an > interface for this reason. It has to be the very class in use. I can > imagine reasons for this. > How does this work? The very class in use meaning? If you make a SequenceFile<IntWritable, Vector>, with the valueClass == Vector.class, you can never pass in something whose runtime class is just Vector, because it's non-instantiatable. You can pass in something which is instanceof Vector, but getClass() != Vector.class. Or am I confused? > > To avoid this kind of mess, it seems the proper approach in MAHOUT-205 > > would be to have one overall VectorWritable class, which can > > serialize/deserialize all Vector implementations. Right? This is how > I've > > Yes this is what I imagine. > Ok, good. That's how I've been working. > > is a pain - you need to either move all the write(DataOutput) and > > readFields(DataInput) methods from the vector implementations into the > new > > VectorWritable, and have a big switch statement deciding which one to > call, > > I would imagine the serialized form of a vector is the same for > SparseVector, DenseVector, etc. There's no question of representation. > You write out all the non-default elements. > This will be twice as large in the dense case (there's no need to write out indices). Ok, not twice as large but size() * (4 + 8) instead of size() * 8. That's a pretty significant cost in terms of disk space and IO time. Similarly, while write out can be the same for different Sparse impls, one will be writing the index/value pairs in order, the other will not, and this will affect what needs to be done on reading in... > Reading in, yes there is some element of choice, and your heuristic is > fine. VectorWritable creates a Vector which can be obtained by the > caller, and could be sparse or dense. > > Is your point that this won't do for some reason? > Oh, it'll work - I'm just seeing that if we get more Vector implementations, to keep serialization separate from the math stuff, it means a proliferation of classes (FooVectorWritable...) and an ever expanding switch statement in the VectorWritable, which could get fragile, and wondered whether there was a "best practice" way this should be done in Hadoop, so you can have Writables which actually live in a useful hierarchy decorating some helpful information on top of classes which do other things. Mahout trunk treats Writable like it was Serializable (or more precisely: Externalizable), which is great and object-oriented and nice. Except that Hadoop totally breaks proper OOP and doesn't let you do that right. I was just hoping that I was misunderstanding how Hadoop works in some way. At this point I don't think I was, unfortunately. This should work fine, it's just not the way I'd do it if I were designing it myself. -jake