Andrzej Bialecki wrote:
I wonder, would it be a good idea to replace the (rather wasteful) 4-byte ints with Lucene's variable-byte int encoding, in all places where size matters?

I'm not sure there are that many places where it could make a big difference.

* UTF8 (2-byte string length)

Currently Nutch uses Java's DataOutput format for UTF8, so this would mean departing from that format, which is not a bad thing. But most strings in Nutch (urls, anchors, etc.) are significantly longer than 4 bytes, so this won't provide a huge savings.

* ArrayWritable/BytesWritable/TwoDArrayWritable (4-byte length)

Are there particular space-sensitive usages of these?

Overall I think the size savings could be considerable, at the cost of some CPU.

I'd be interested to see what the size savings really amount to.

A more substantial savings might be had if we developed a version of MapFile which writes keys as differences from the previous key. That could make, e.g.., all of the url-keyed files smaller.

Another good way to save space would be to use a faster compression algorithm in SequenceFile. The LZO algorithm is many times faster than the gzip we use now.

Doug

Reply via email to