On Tue, Jan 5, 2010 at 1:36 AM, Sean Owen <sro...@gmail.com> wrote:

> On Tue, Jan 5, 2010 at 9:18 AM, Jake Mannix <jake.man...@gmail.com> wrote:
> >  From what I can tell, in SequenceFile.Writer#append(Object key, Object
> > value) (why on earth is it taking Objects?  shouldn't these be
> Writables?),
>
> There's also a version that takes Writables. Why, I don't know, but
> assume you're triggering the other one.
>

Yeah, I saw that after I wrote it.  I realized why: because Serializer can
take
non-writable objects and write them if they are implemented to do so.


> > it does an explicit check of key.getClass == this.keyClass and
> > value.getClass() == this.valueClass, which won't do any subclass matching
> > (and so will fail if value.getClass() is DenseVector.class, and
> valueClass
> > is SparseVector.class, or just Vector.class).
>
> Yes, I've hit this too. You can say the value or key class is an
> interface for this reason. It has to be the very class in use. I can
> imagine reasons for this.
>

How does this work?  The very class in use meaning?  If you make a
SequenceFile<IntWritable, Vector>, with the valueClass == Vector.class,
you can never pass in something whose runtime class is just Vector, because
it's non-instantiatable.  You can pass in something which is instanceof
Vector,
but getClass() != Vector.class.  Or am I confused?


> >  To avoid this kind of mess, it seems the proper approach in MAHOUT-205
> > would be to have one overall VectorWritable class, which can
> > serialize/deserialize all Vector implementations.  Right?  This is how
> I've
>
> Yes this is what I imagine.
>

Ok, good. That's how I've been working.


> > is a pain - you need to either move all the write(DataOutput) and
> > readFields(DataInput) methods from the vector implementations into the
> new
> > VectorWritable, and have a big switch statement deciding which one to
> call,
>
> I would imagine the serialized form of a vector is the same for
> SparseVector, DenseVector, etc. There's no question of representation.
> You write out all the non-default elements.
>

This will be twice as large in the dense case (there's no need to write out
indices). Ok, not twice as large but size() * (4 + 8) instead of size() *
8.
That's a pretty significant cost in terms of disk space and IO time.

Similarly, while write out can be the same for different Sparse impls, one
will be writing the index/value pairs in order, the other will not, and
this
will affect what needs to be done on reading in...


> Reading in, yes there is some element of choice, and your heuristic is
> fine. VectorWritable creates a Vector which can be obtained by the
> caller, and could be sparse or dense.
>
> Is your point that this won't do for some reason?
>

Oh, it'll work - I'm just seeing that if we get more Vector
implementations,
to keep serialization separate from the math stuff, it means a proliferation
of classes (FooVectorWritable...) and an ever expanding switch statement
in the VectorWritable, which could get fragile, and wondered whether there
was a "best practice" way this should be done in Hadoop, so you can have
Writables which actually live in a useful hierarchy decorating some helpful
information on top of classes which do other things.  Mahout trunk treats
Writable like it was Serializable (or more precisely: Externalizable),
which
is great and object-oriented and nice.  Except that Hadoop totally breaks
proper OOP and doesn't let you do that right.

I was just hoping that I was misunderstanding how Hadoop works in some
way.  At this point I don't think I was, unfortunately.

This should work fine, it's just not the way I'd do it if I were designing
it
myself.

  -jake

Reply via email to