I'm curious what others will say about Hadoop. I'll just recomment BDB, as I have good experience combining Lucene indices where only the id field is stored, and BDBs are used to store and retrieve data for a set of ids for a given search result.
Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Andy Liu <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, April 10, 2007 5:41:36 PM Subject: Using Hadoop for Record storage Currently I'm working on a search application that uses Lucene. Many of the fields I index in Lucene are stored fields, because I need to retrieve the actual text and metadata of each document, and subsequently present the data to the user. We're starting to work with tens of millions of documents, so scalability of our application is a concern that we're currently addressing. One specific point we're looking at is whether or not it makes sense to use Lucene as strictly an inverted index, and store the document text and metadata in a different type of datastore. From my understanding, the advantages of doing this are: 1. Indexing will be faster, since stored fields need to be written and re-written during Lucene segment merging. 2. The separation affords more flexibility, if say I want to do multiple indexes and a distributed search, the records data can be distributed differently/separately from the Lucene index. 3. Maybe it is possible to select a datastore technology that would be faster than Lucene at retrieving document data, especially in the 30M-50M document collection range I'm exploring the possibility of using the Hadoop records framework to store these document records on disk. Here are my questions: 1. Is this a good application of the Hadoop records framework, keeping in mind that my goals are speed and scalability? I'm assuming the answer is yes, especially considering Nutch uses the same approach 2. Is Hadoop records the fastest and most scalable technology to tackle this problem? Are there other record storage technologies out there that you can recommend? I'm assuming traditional RDBMS's would not scale as gracefully, although if anybody has successfully tackled this problem using a traditional database let me know. Andy
