Re: Using Hadoop for Record storage

Otis Gospodnetic Thu, 12 Apr 2007 09:47:07 -0700

I'm curious what others will say about Hadoop.  I'll just recomment BDB, as I 
have good experience combining Lucene indices where only the id field is 
stored, and BDBs are used to store and retrieve data for a set of ids for a 
given search result.

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Andy Liu <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Tuesday, April 10, 2007 5:41:36 PM
Subject: Using Hadoop for Record storage

Currently I'm working on a search application that uses Lucene.  Many of the
fields I index in Lucene are stored fields, because I need to retrieve the
actual text and metadata of each document, and subsequently present the data
to the user.

We're starting to work with tens of millions of documents, so scalability of
our application is a concern that we're currently addressing.  One specific
point we're looking at is whether or not it makes sense to use Lucene as
strictly an inverted index, and store the document text and metadata in a
different type of datastore.  From my understanding, the advantages of doing
this are:

1. Indexing will be faster, since stored fields need to be written and
re-written during Lucene segment merging.
2. The separation affords more flexibility, if say I want to do multiple
indexes and a distributed search, the records data can be distributed
differently/separately from the Lucene index.
3. Maybe it is possible to select a datastore technology that would be
faster than Lucene at retrieving document data, especially in the 30M-50M
document collection range

I'm exploring the possibility of using the Hadoop records framework to store
these document records on disk.  Here are my questions:

1. Is this a good application of the Hadoop records framework, keeping in
mind that my goals are speed and scalability?  I'm assuming the answer is
yes, especially considering Nutch uses the same approach

2. Is Hadoop records the fastest and most scalable technology to tackle this
problem?  Are there other record storage technologies out there that you can
recommend?  I'm assuming traditional RDBMS's would not scale as gracefully,
although if anybody has successfully tackled this problem using a
traditional database let me know.

Andy

Re: Using Hadoop for Record storage

Reply via email to