I've wondered about the possibility of adding HDFS as a back-end to an existing key-value store, like EHCache or Sleepycat. There are several such projects that have excellent engineering and address problems such as this. There are advantages to incorporating them, rather than re-writing them.
Other than that, I'd think about techniques like multiple levels of index, and caching on the local disk drive - lock the top level into memory, and perhaps use something like memory-mapped files for additional levels. dbr On 7/28/2009 8:53 AM, Andy Liu wrote: > I have a bunch of Map/Reduce jobs that process documents and writes the > results out to a few MapFiles. These MapFiles are subsequently searched in > an interactive application. > > One problem I'm running into is that if the values in the MapFile data file > are fairly large, lookup can be slow. This is because the MapFile index > only stores every 128th key by default (io.map.index.interval), and after > the binary search it may have to scan/skip through up to 127 values (off of > disk) before it finds the matching record. I've tried io.map.index.interval > = 1, which brings average get() times from 1200ms to 200ms, but at the cost > of memory during runtime, which is undesirable. > > One possible solution is to have the MapFile index store every single <key, > offset> pair. Then MapFile.Reader, upon startup, would read every 128th key > in memory. MapFile.Reader.get() would behave the same way except instead of > seeking through the values SequenceFile it would seek through the index > SequenceFile until it finds the matching record, and then it can seek to the > corresponding offset in the values. I'm going off the assumption that it's > much faster to scan through the index (small keys) than it is to scan > through the values (large values). > > Or maybe the index can be some kind of disk-based btree or bdb-like > implementation? > > Anybody encounter this problem before? > > Andy > >
