Re: Distributed Indexer?

Andrzej Bialecki Fri, 21 Mar 2008 03:23:22 -0700

eks dev wrote:

indeed, Ning made very nice job. This can also be used as an
alternative to rsync (solr, nutch) scripts utilizing hadoop.

Again, our "shards" are more complicated than Lucene indexes, and don'tyield so easily to incremental updates, so this mechanism is of limiteduse to Nutch.

I think that what we ultimately need is a form of "shard manager"component that runs on the cluster and communicates with itssub-components running on search servers.

The shard manager would be responsible for preparing a de-duplicated setof active shards (within the "active shard" time window), and thendeploying them on search servers in a way that minimizes traffic,minimizes downtime and ensures redundancy. The deployment sub-componentson shard servers would have to take care of removing obsolete shards (orobsolete documents from still-active shards) and adding new shards.

The query-integrator component (a.k.a. NutchBean, or the searchfront-end) would have to be modified. First, we need to review and applythe patch in NUTCH-92 to fix the scoring issues. Second, we would needto implement some robustness in the distributed searching, so that thefront-end knows about multiple replicas of active shards and can routequeries to respective shard servers based on their current load andavailability.

This is still just a bunch of loosely-connected ideas I have about thefuture Nutch architecture ... but sooneer or later _something_ has to bedone to ease the pain that is the deployment of crawl artifacts,especially in a continously running operation with incremental indexing ...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Distributed Indexer?

Reply via email to