Re: Using Hadoop for Record storage

Andy Liu Fri, 13 Apr 2007 10:11:41 -0700

My benchmark was for 5M records, on a 64-bit Opteron with 16gigs of memory.
The Lucene index was about 10G, and the Hadoop recordstore was a few gigs
smaller.


Performing random seeks, the results were:

Hadoop records: 16.84 ms per seek
Lucene records w/TermQuery: 37.17 ms per seek
Lucene records by Lucene document iD: 0.11 ms per see

All 3 benchmarks were performed under same conditions.  The 2 Lucene
benchmarks were performed on separate days, so I don't think the buffer
cache would've kept the index in memory, although I must admit that I'm
quite ignorant of how Linux buffer caches really work.

Andy
On 4/13/07, Doug Cutting <[EMAIL PROTECTED]> wrote:

How big was your benchmark?  For micro-benchmarks, CPU time will
dominate.  For random access to collections larger than memory, disk
seeks should dominate.  If you're interested in the latter case, then
you should benchmark this: build a database substantially larger than
the memory on your machine, and access it randomly for a while.

Doug

Andy Liu wrote:
> I ran a quick benchmark between Hadoop MapFile and Lucene's stored
fields.
> Using String keys, Hadoop was faster than Lucene, since in Lucene this
> requires a TermQuery before the document data can be accessed.  However,
> using Lucene's internal ID's, pulling up the data is orders of magnitude
> faster than MapFile.  Looking at the code, it makes sense why: MapFile
uses
> a binary search on sorted keys to locate the data offsets, while
Lucene's
> internal ID's simply point to an offset in an index file that points to
the
> data offset in the .fdt file.  I'm assuming in terms of accessing random
> records, it just doesn't get any faster than this.
>
> My application doesn't require any incremental updates, so I'm
considering
> using Lucene's FSDirectory/IndexOutput/IndexInput to write out
serialized
> records in the similar way Lucene handles stored fields.  The only
drawback
> is that I'll have to lookup the records using the internal ID's.  I'm
> looking at BDB as well, since there's no limitation to what type of keys
I
> can use to look up the records.  Thanks for your help.
>
> Andy
>
> On 4/12/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
>>
>> Andy Liu wrote:
>> > I'm exploring the possibility of using the Hadoop records framework
to
>> > store
>> > these document records on disk.  Here are my questions:
>> >
>> > 1. Is this a good application of the Hadoop records framework,
keeping
>> in
>> > mind that my goals are speed and scalability?  I'm assuming the
answer
>> is
>> > yes, especially considering Nutch uses the same approach
>>
>> For read-only access, performance should be decent.  However Hadoop's
>> file structures do not permit incremental updates.  Rather they are
>> primarily designed for batch operations, like MapReduce outputs.  If
you
>> need to incrementally update your data, then you might look at
something
>> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
>> designed to be a much more scalable, incrementally updateable DB than
>> BDB or relational DBs, but its implementation is not yet complete.)
>>
>> Doug
>>
>

Re: Using Hadoop for Record storage

Reply via email to