Re: Using Hadoop for Record storage

Doug Cutting Fri, 13 Apr 2007 09:14:23 -0700

How big was your benchmark? For micro-benchmarks, CPU time willdominate. For random access to collections larger than memory, diskseeks should dominate. If you're interested in the latter case, thenyou should benchmark this: build a database substantially larger thanthe memory on your machine, and access it randomly for a while.


Doug


Andy Liu wrote:

I ran a quick benchmark between Hadoop MapFile and Lucene's stored fields.
Using String keys, Hadoop was faster than Lucene, since in Lucene this
requires a TermQuery before the document data can be accessed.  However,
using Lucene's internal ID's, pulling up the data is orders of magnitude
faster than MapFile.  Looking at the code, it makes sense why: MapFile uses
a binary search on sorted keys to locate the data offsets, while Lucene's
internal ID's simply point to an offset in an index file that points to the
data offset in the .fdt file.  I'm assuming in terms of accessing random
records, it just doesn't get any faster than this.

My application doesn't require any incremental updates, so I'm considering
using Lucene's FSDirectory/IndexOutput/IndexInput to write out serialized
records in the similar way Lucene handles stored fields.  The only drawback
is that I'll have to lookup the records using the internal ID's.  I'm
looking at BDB as well, since there's no limitation to what type of keys I
can use to look up the records.  Thanks for your help.

Andy

On 4/12/07, Doug Cutting <[EMAIL PROTECTED]> wrote:

Andy Liu wrote:
> I'm exploring the possibility of using the Hadoop records framework to
> store
> these document records on disk.  Here are my questions:
>
> 1. Is this a good application of the Hadoop records framework, keeping
in
> mind that my goals are speed and scalability?  I'm assuming the answer
is
> yes, especially considering Nutch uses the same approach

For read-only access, performance should be decent.  However Hadoop's
file structures do not permit incremental updates.  Rather they are
primarily designed for batch operations, like MapReduce outputs.  If you
need to incrementally update your data, then you might look at something
like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
designed to be a much more scalable, incrementally updateable DB than
BDB or relational DBs, but its implementation is not yet complete.)

Doug

Re: Using Hadoop for Record storage

Reply via email to