Re: lucene index on hadoop

Dennis Kubes Tue, 21 Nov 2006 20:18:05 -0800

I should have been more specific. Create the indexes using mapreduce,then store on the dfs using the indexer job. To have clusters ofservers answer a single query we have found a best practice to besplitting the index and associated databases into smaller pieces andhaving those pieces on local file system that are fronted by distributedsearch servers. Then have a search website that uses the search serversto answer the query. An example of this setup can be found on theNutchHadoopTutorial on the Nutch wiki.


Dennis


Doug Cutting wrote:

Dennis Kubes wrote:
You would build the indexes on hadoop but then move then to localfile systems for searching. You wouldn't want to perform searchesusing the DFS.
Creating Lucene indexes directly in DFS would be pretty slow. Nutchcreates them locally, then copies them to DFS to avoid this.
One could create a Lucene Directory implementation optimized forupdates, where new files are written locally, and only flushed to DFSwhen the Directory is closed. When updating, Lucene creates and readslots of files that might not last very long, so there's little pointin replicating them on the network. For many applications, thatshould be considerably faster than either updating indexes directly inHDFS, or copying the entire index locally, modifying it, then copyingit back.
Lucene search works from HDFS-resident indexes, but is slow,especially if the indexes were created on a different node than thatsearching them. (HDFS tries to write one replica of each blocklocally on the node where it is created.)
Doug

Re: lucene index on hadoop

Reply via email to