Andrzej Bialecki wrote:
I wonder, would it be a good idea to replace the (rather wasteful)
4-byte ints with Lucene's variable-byte int encoding, in all places
where size matters?
I'm not sure there are that many places where it could make a big
difference.
* UTF8 (2-byte string length)
Currently Nutch uses Java's DataOutput format for UTF8, so this would
mean departing from that format, which is not a bad thing. But most
strings in Nutch (urls, anchors, etc.) are significantly longer than 4
bytes, so this won't provide a huge savings.
* ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)
Are there particular space-sensitive usages of these?
Overall I think the size savings could be considerable, at the cost of
some CPU.
I'd be interested to see what the size savings really amount to.
A more substantial savings might be had if we developed a version of
MapFile which writes keys as differences from the previous key. That
could make, e.g.., all of the url-keyed files smaller.
Another good way to save space would be to use a faster compression
algorithm in SequenceFile. The LZO algorithm is many times faster than
the gzip we use now.
Doug