Is this what is commonly referred to as zig-zag encoding? Avro uses the same
technique:
http://hadoop.apache.org/avro/docs/1.3.2/spec.html#binary_encoding

For sequential sparse vectors it we could get an additional win by delta
encoding the indexes. This would allow the index, stored as the difference
from the previous index, to be kept to two bytes in many cases.

Regardless, vint encoding will produce a significant space savings and
Sean's right: it has also been my experience that space savings often trump
speed simply because of the speed of network or storage.

Do anyone have any idea whether greater gains to be found by finely tuning
the base encoding vs. relying on some form of SequenceFile block
compression? (or do both approaches compliment each other nicely?)

On Sun, May 2, 2010 at 12:33 PM, Sean Owen <sro...@gmail.com> wrote:

> That's the one! I actually didn't know this was how PBs did the
> variable length encoding but makes sense, it's about the most
> efficient thing I can imagine.
>
> Values up to 16,383 fit in two bytes, which less than a 4-byte int and
> the 3 bytes or so it would take the other scheme. Could add up over
> thousands of elements times millions of vectors.
>
> Decoding isn't too slow and if one believes this isn't an unusual
> encoding, it's not so problematic to use it in a format that others
> outside Mahout may wish to consume.
>
> On Sun, May 2, 2010 at 5:23 PM, Robin Anil <robin.a...@gmail.com> wrote:
> > You mean this type of encoding instead?
> >  http://code.google.com/apis/protocolbuffers/docs/encoding.html
>

Reply via email to