Dennis Kubes wrote:
You would build the indexes on hadoop but then move then to local file systems for searching. You wouldn't want to perform searches using the DFS.
Creating Lucene indexes directly in DFS would be pretty slow. Nutch creates them locally, then copies them to DFS to avoid this.
One could create a Lucene Directory implementation optimized for updates, where new files are written locally, and only flushed to DFS when the Directory is closed. When updating, Lucene creates and reads lots of files that might not last very long, so there's little point in replicating them on the network. For many applications, that should be considerably faster than either updating indexes directly in HDFS, or copying the entire index locally, modifying it, then copying it back.
Lucene search works from HDFS-resident indexes, but is slow, especially if the indexes were created on a different node than that searching them. (HDFS tries to write one replica of each block locally on the node where it is created.)
Doug
