I ran a quick benchmark between Hadoop MapFile and Lucene's stored fields. Using String keys, Hadoop was faster than Lucene, since in Lucene this requires a TermQuery before the document data can be accessed. However, using Lucene's internal ID's, pulling up the data is orders of magnitude faster than MapFile. Looking at the code, it makes sense why: MapFile uses a binary search on sorted keys to locate the data offsets, while Lucene's internal ID's simply point to an offset in an index file that points to the data offset in the .fdt file. I'm assuming in terms of accessing random records, it just doesn't get any faster than this.
My application doesn't require any incremental updates, so I'm considering using Lucene's FSDirectory/IndexOutput/IndexInput to write out serialized records in the similar way Lucene handles stored fields. The only drawback is that I'll have to lookup the records using the internal ID's. I'm looking at BDB as well, since there's no limitation to what type of keys I can use to look up the records. Thanks for your help. Andy On 4/12/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
Andy Liu wrote: > I'm exploring the possibility of using the Hadoop records framework to > store > these document records on disk. Here are my questions: > > 1. Is this a good application of the Hadoop records framework, keeping in > mind that my goals are speed and scalability? I'm assuming the answer is > yes, especially considering Nutch uses the same approach For read-only access, performance should be decent. However Hadoop's file structures do not permit incremental updates. Rather they are primarily designed for batch operations, like MapReduce outputs. If you need to incrementally update your data, then you might look at something like BDB, a relational DB, or perhaps experiment with HBase. (HBase is designed to be a much more scalable, incrementally updateable DB than BDB or relational DBs, but its implementation is not yet complete.) Doug
