Re: MapFile performance

Billy Pearson Sun, 02 Aug 2009 19:10:02 -0700

not sure if its still there but there was a parm in the hadoop-site conffile that would allow you to skip x number if index when reading it in tomemory.

From what I understand we scan find the key offset just before the data and

seek once and read until we find the key.

Billy

----- Original Message -----From: "Andy Liu" <[email protected]>

Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To: <core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/[email protected]>
Sent: Tuesday, July 28, 2009 7:53 AM
Subject: MapFile performance

I have a bunch of Map/Reduce jobs that process documents and writes the
results out to a few MapFiles. These MapFiles are subsequently searchedin
an interactive application.
One problem I'm running into is that if the values in the MapFile datafile
are fairly large, lookup can be slow.  This is because the MapFile index
only stores every 128th key by default (io.map.index.interval), and after
the binary search it may have to scan/skip through up to 127 values (offofdisk) before it finds the matching record. I've triedio.map.index.interval= 1, which brings average get() times from 1200ms to 200ms, but at thecost
of memory during runtime, which is undesirable.
One possible solution is to have the MapFile index store every single<key,offset> pair. Then MapFile.Reader, upon startup, would read every 128thkeyin memory. MapFile.Reader.get() would behave the same way except insteadof
seeking through the values SequenceFile it would seek through the index
SequenceFile until it finds the matching record, and then it can seek tothecorresponding offset in the values. I'm going off the assumption thatit's
much faster to scan through the index (small keys) than it is to scan
through the values (large values).

Or maybe the index can be some kind of disk-based btree or bdb-like
implementation?

Anybody encounter this problem before?

Andy

Re: MapFile performance

Reply via email to