My benchmark was for 5M records, on a 64-bit Opteron with 16gigs of memory. The Lucene index was about 10G, and the Hadoop recordstore was a few gigs smaller.
Performing random seeks, the results were: Hadoop records: 16.84 ms per seek Lucene records w/TermQuery: 37.17 ms per seek Lucene records by Lucene document iD: 0.11 ms per see All 3 benchmarks were performed under same conditions. The 2 Lucene benchmarks were performed on separate days, so I don't think the buffer cache would've kept the index in memory, although I must admit that I'm quite ignorant of how Linux buffer caches really work. Andy On 4/13/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
How big was your benchmark? For micro-benchmarks, CPU time will dominate. For random access to collections larger than memory, disk seeks should dominate. If you're interested in the latter case, then you should benchmark this: build a database substantially larger than the memory on your machine, and access it randomly for a while. Doug Andy Liu wrote: > I ran a quick benchmark between Hadoop MapFile and Lucene's stored fields. > Using String keys, Hadoop was faster than Lucene, since in Lucene this > requires a TermQuery before the document data can be accessed. However, > using Lucene's internal ID's, pulling up the data is orders of magnitude > faster than MapFile. Looking at the code, it makes sense why: MapFile uses > a binary search on sorted keys to locate the data offsets, while Lucene's > internal ID's simply point to an offset in an index file that points to the > data offset in the .fdt file. I'm assuming in terms of accessing random > records, it just doesn't get any faster than this. > > My application doesn't require any incremental updates, so I'm considering > using Lucene's FSDirectory/IndexOutput/IndexInput to write out serialized > records in the similar way Lucene handles stored fields. The only drawback > is that I'll have to lookup the records using the internal ID's. I'm > looking at BDB as well, since there's no limitation to what type of keys I > can use to look up the records. Thanks for your help. > > Andy > > On 4/12/07, Doug Cutting <[EMAIL PROTECTED]> wrote: >> >> Andy Liu wrote: >> > I'm exploring the possibility of using the Hadoop records framework to >> > store >> > these document records on disk. Here are my questions: >> > >> > 1. Is this a good application of the Hadoop records framework, keeping >> in >> > mind that my goals are speed and scalability? I'm assuming the answer >> is >> > yes, especially considering Nutch uses the same approach >> >> For read-only access, performance should be decent. However Hadoop's >> file structures do not permit incremental updates. Rather they are >> primarily designed for batch operations, like MapReduce outputs. If you >> need to incrementally update your data, then you might look at something >> like BDB, a relational DB, or perhaps experiment with HBase. (HBase is >> designed to be a much more scalable, incrementally updateable DB than >> BDB or relational DBs, but its implementation is not yet complete.) >> >> Doug >> >
