not sure if its still there but there was a parm in the hadoop-site conf
file that would allow you to skip x number if index when reading it in to
memory.
From what I understand we scan find the key offset just before the data and
seek once and read until we find the key.
Billy
----- Original Message -----
From: "Andy Liu" <[email protected]>
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/[email protected]>
Sent: Tuesday, July 28, 2009 7:53 AM
Subject: MapFile performance
I have a bunch of Map/Reduce jobs that process documents and writes the
results out to a few MapFiles. These MapFiles are subsequently searched
in
an interactive application.
One problem I'm running into is that if the values in the MapFile data
file
are fairly large, lookup can be slow. This is because the MapFile index
only stores every 128th key by default (io.map.index.interval), and after
the binary search it may have to scan/skip through up to 127 values (off
of
disk) before it finds the matching record. I've tried
io.map.index.interval
= 1, which brings average get() times from 1200ms to 200ms, but at the
cost
of memory during runtime, which is undesirable.
One possible solution is to have the MapFile index store every single
<key,
offset> pair. Then MapFile.Reader, upon startup, would read every 128th
key
in memory. MapFile.Reader.get() would behave the same way except instead
of
seeking through the values SequenceFile it would seek through the index
SequenceFile until it finds the matching record, and then it can seek to
the
corresponding offset in the values. I'm going off the assumption that
it's
much faster to scan through the index (small keys) than it is to scan
through the values (large values).
Or maybe the index can be some kind of disk-based btree or bdb-like
implementation?
Anybody encounter this problem before?
Andy