Hi,
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters? We could "borrow" the code from Lucene and create a
VIntWritable for this purpose. I'm thinking specifically about the
following places:
* UTF8 (2-byte string length)
* ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)
* Properties and derived maps (like ContentProperties): all lengths are
written as 4-byte ints.
* any Writable that consists of lists of values is currently serialized
using 4-byte ints for the size of list, e.g. ParseData.outlinks
Overall I think the size savings could be considerable, at the cost of
some CPU.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com