eks dev wrote:
indeed, Ning made very nice job. This can also be used as an
alternative to rsync (solr, nutch) scripts utilizing hadoop.

Again, our "shards" are more complicated than Lucene indexes, and don't yield so easily to incremental updates, so this mechanism is of limited use to Nutch.

I think that what we ultimately need is a form of "shard manager" component that runs on the cluster and communicates with its sub-components running on search servers.

The shard manager would be responsible for preparing a de-duplicated set of active shards (within the "active shard" time window), and then deploying them on search servers in a way that minimizes traffic, minimizes downtime and ensures redundancy. The deployment sub-components on shard servers would have to take care of removing obsolete shards (or obsolete documents from still-active shards) and adding new shards.

The query-integrator component (a.k.a. NutchBean, or the search front-end) would have to be modified. First, we need to review and apply the patch in NUTCH-92 to fix the scoring issues. Second, we would need to implement some robustness in the distributed searching, so that the front-end knows about multiple replicas of active shards and can route queries to respective shard servers based on their current load and availability.

This is still just a bunch of loosely-connected ideas I have about the future Nutch architecture ... but sooneer or later _something_ has to be done to ease the pain that is the deployment of crawl artifacts, especially in a continously running operation with incremental indexing ...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to