On Tue, Jan 5, 2010 at 4:38 PM, Jake Mannix <jake.man...@gmail.com> wrote:
> How does this work?  The very class in use meaning?  If you make a
> SequenceFile<IntWritable, Vector>, with the valueClass == Vector.class,
> you can never pass in something whose runtime class is just Vector, because
> it's non-instantiatable.  You can pass in something which is instanceof
> Vector,
> but getClass() != Vector.class.  Or am I confused?

Er I may be speaking about something different, that I thought was the
same thing. In a Reducer, for example, you can't specify that the
value output type is "Vector" -- has to be "SparseVector" or other
implementation.

The restriction isn't due to the generic types or anything but a
result of checks in the code like this.

It's not possible that an object's type is "just Vector" since it's an
interface, so yes the check could never pass.


> Similarly, while write out can be the same for different Sparse impls, one
> will be writing the index/value pairs in order, the other will not, and
> this
> will affect what needs to be done on reading in...

You can always serialize the type of vector being written first or
something. This is what Java serialization does too.

The fact that we're led to java.io.Serializable reinforces the
question I've always had about this aspect of Hadoop -- what was so
unusable about the existing serialization mechanism? seems like a
needless reinvention, that foregoes some of the nice aspects of the
serialization mechanism.

... which further leads me to comment that the *best* way of
approaching all this would be to implement Serializable correctly.
Then generically create one Writable wrapper that leverages
Serializable to do its work. Then we have everything implemented
nicely. You can use Vectors in any context that leverages standard
serialization -- which is a big deal, for example, in J2EE.

I stand by that until someone points out why this won't work or is
slow or something at runtime; don't see it yet.


> Oh, it'll work - I'm just seeing that if we get more Vector
> implementations,
> to keep serialization separate from the math stuff, it means a proliferation
> of classes (FooVectorWritable...) and an ever expanding switch statement
> in the VectorWritable, which could get fragile, and wondered whether there
> was a "best practice" way this should be done in Hadoop, so you can have
> Writables which actually live in a useful hierarchy decorating some helpful
> information on top of classes which do other things.  Mahout trunk treats
> Writable like it was Serializable (or more precisely: Externalizable),
> which
> is great and object-oriented and nice.  Except that Hadoop totally breaks
> proper OOP and doesn't let you do that right.

I think the world of vectors probably does break down into, at most,
sparse and dense representations. So maybe there are at most two
serialization routines to write. Not bad. I don't really see what's so
wrong-ish about needing a serialization mechanism for every distinct
representation -- that would make logical sense at least.

What other representations are we anticipating anyhow?

Reply via email to