Re: Using Hadoop for Record storage

Arun C Murthy Sun, 15 Apr 2007 12:09:24 -0700

On Fri, Apr 13, 2007 at 01:11:18PM -0400, Andy Liu wrote:
>
>All 3 benchmarks were performed under same conditions.  The 2 Lucene
>benchmarks were performed on separate days, so I don't think the buffer
>cache would've kept the index in memory, although I must admit that I'm
>quite ignorant of how Linux buffer caches really work.
>


In my previous life the uninformed way of achieving something like this was to 
mmap a file whose size was greater than available ram, write some 0s to the 
entire file, sync it and exit. ymmv.

However this post piqued my curiosity and apparently if you have a kernel newer 
than 2.6.16.* you coud try this:
http://aplawrence.com/Linux/buffer_cache.html

hth,
Arun

>Andy
>On 4/13/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
>>How big was your benchmark?  For micro-benchmarks, CPU time will
>>dominate.  For random access to collections larger than memory, disk
>>seeks should dominate.  If you're interested in the latter case, then
>>you should benchmark this: build a database substantially larger than
>>the memory on your machine, and access it randomly for a while.
>>
>>Doug
>>
>>Andy Liu wrote:
>>> I ran a quick benchmark between Hadoop MapFile and Lucene's stored
>>fields.
>>> Using String keys, Hadoop was faster than Lucene, since in Lucene this
>>> requires a TermQuery before the document data can be accessed.  However,
>>> using Lucene's internal ID's, pulling up the data is orders of magnitude
>>> faster than MapFile.  Looking at the code, it makes sense why: MapFile
>>uses
>>> a binary search on sorted keys to locate the data offsets, while
>>Lucene's
>>> internal ID's simply point to an offset in an index file that points to
>>the
>>> data offset in the .fdt file.  I'm assuming in terms of accessing random
>>> records, it just doesn't get any faster than this.
>>>
>>> My application doesn't require any incremental updates, so I'm
>>considering
>>> using Lucene's FSDirectory/IndexOutput/IndexInput to write out
>>serialized
>>> records in the similar way Lucene handles stored fields.  The only
>>drawback
>>> is that I'll have to lookup the records using the internal ID's.  I'm
>>> looking at BDB as well, since there's no limitation to what type of keys
>>I
>>> can use to look up the records.  Thanks for your help.
>>>
>>> Andy
>>>
>>> On 4/12/07, Doug Cutting <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Andy Liu wrote:
>>>> > I'm exploring the possibility of using the Hadoop records framework
>>to
>>>> > store
>>>> > these document records on disk.  Here are my questions:
>>>> >
>>>> > 1. Is this a good application of the Hadoop records framework,
>>keeping
>>>> in
>>>> > mind that my goals are speed and scalability?  I'm assuming the
>>answer
>>>> is
>>>> > yes, especially considering Nutch uses the same approach
>>>>
>>>> For read-only access, performance should be decent.  However Hadoop's
>>>> file structures do not permit incremental updates.  Rather they are
>>>> primarily designed for batch operations, like MapReduce outputs.  If
>>you
>>>> need to incrementally update your data, then you might look at
>>something
>>>> like BDB, a relational DB, or perhaps experiment with HBase.  (HBase is
>>>> designed to be a much more scalable, incrementally updateable DB than
>>>> BDB or relational DBs, but its implementation is not yet complete.)
>>>>
>>>> Doug
>>>>
>>>
>>

Re: Using Hadoop for Record storage

Reply via email to