I have a bunch of Map/Reduce jobs that process documents and writes the
results out to a few MapFiles.  These MapFiles are subsequently searched in
an interactive application.

One problem I'm running into is that if the values in the MapFile data file
are fairly large, lookup can be slow.  This is because the MapFile index
only stores every 128th key by default (io.map.index.interval), and after
the binary search it may have to scan/skip through up to 127 values (off of
disk) before it finds the matching record.  I've tried io.map.index.interval
= 1, which brings average get() times from 1200ms to 200ms, but at the cost
of memory during runtime, which is undesirable.

One possible solution is to have the MapFile index store every single <key,
offset> pair.  Then MapFile.Reader, upon startup, would read every 128th key
in memory.  MapFile.Reader.get() would behave the same way except instead of
seeking through the values SequenceFile it would seek through the index
SequenceFile until it finds the matching record, and then it can seek to the
corresponding offset in the values.  I'm going off the assumption that it's
much faster to scan through the index (small keys) than it is to scan
through the values (large values).

Or maybe the index can be some kind of disk-based btree or bdb-like
implementation?

Anybody encounter this problem before?

Andy

Reply via email to