not sure if its still there but there was a parm in the hadoop-site conf file that would allow you to skip x number if index when reading it in to memory.
From what I understand we scan find the key offset just before the data and
seek once and read until we find the key.

Billy


----- Original Message ----- From: "Andy Liu" <[email protected]>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/[email protected]>
Sent: Tuesday, July 28, 2009 7:53 AM
Subject: MapFile performance


I have a bunch of Map/Reduce jobs that process documents and writes the
results out to a few MapFiles. These MapFiles are subsequently searched in
an interactive application.

One problem I'm running into is that if the values in the MapFile data file
are fairly large, lookup can be slow.  This is because the MapFile index
only stores every 128th key by default (io.map.index.interval), and after
the binary search it may have to scan/skip through up to 127 values (off of disk) before it finds the matching record. I've tried io.map.index.interval = 1, which brings average get() times from 1200ms to 200ms, but at the cost
of memory during runtime, which is undesirable.

One possible solution is to have the MapFile index store every single <key, offset> pair. Then MapFile.Reader, upon startup, would read every 128th key in memory. MapFile.Reader.get() would behave the same way except instead of
seeking through the values SequenceFile it would seek through the index
SequenceFile until it finds the matching record, and then it can seek to the corresponding offset in the values. I'm going off the assumption that it's
much faster to scan through the index (small keys) than it is to scan
through the values (large values).

Or maybe the index can be some kind of disk-based btree or bdb-like
implementation?

Anybody encounter this problem before?

Andy



Reply via email to