I should have been more specific. Create the indexes using mapreduce,
then store on the dfs using the indexer job. To have clusters of
servers answer a single query we have found a best practice to be
splitting the index and associated databases into smaller pieces and
having those pieces on local file system that are fronted by distributed
search servers. Then have a search website that uses the search servers
to answer the query. An example of this setup can be found on the
NutchHadoopTutorial on the Nutch wiki.
Dennis
Doug Cutting wrote:
Dennis Kubes wrote:
You would build the indexes on hadoop but then move then to local
file systems for searching. You wouldn't want to perform searches
using the DFS.
Creating Lucene indexes directly in DFS would be pretty slow. Nutch
creates them locally, then copies them to DFS to avoid this.
One could create a Lucene Directory implementation optimized for
updates, where new files are written locally, and only flushed to DFS
when the Directory is closed. When updating, Lucene creates and reads
lots of files that might not last very long, so there's little point
in replicating them on the network. For many applications, that
should be considerably faster than either updating indexes directly in
HDFS, or copying the entire index locally, modifying it, then copying
it back.
Lucene search works from HDFS-resident indexes, but is slow,
especially if the indexes were created on a different node than that
searching them. (HDFS tries to write one replica of each block
locally on the node where it is created.)
Doug