I should have been more specific. Create the indexes using mapreduce, then store on the dfs using the indexer job. To have clusters of servers answer a single query we have found a best practice to be splitting the index and associated databases into smaller pieces and having those pieces on local file system that are fronted by distributed search servers. Then have a search website that uses the search servers to answer the query. An example of this setup can be found on the NutchHadoopTutorial on the Nutch wiki.

Dennis

Doug Cutting wrote:
Dennis Kubes wrote:
You would build the indexes on hadoop but then move then to local file systems for searching. You wouldn't want to perform searches using the DFS.

Creating Lucene indexes directly in DFS would be pretty slow. Nutch creates them locally, then copies them to DFS to avoid this.

One could create a Lucene Directory implementation optimized for updates, where new files are written locally, and only flushed to DFS when the Directory is closed. When updating, Lucene creates and reads lots of files that might not last very long, so there's little point in replicating them on the network. For many applications, that should be considerably faster than either updating indexes directly in HDFS, or copying the entire index locally, modifying it, then copying it back.

Lucene search works from HDFS-resident indexes, but is slow, especially if the indexes were created on a different node than that searching them. (HDFS tries to write one replica of each block locally on the node where it is created.)

Doug

Reply via email to