RE: Architecture for parallell crawling

J B Wed, 01 Jun 2005 13:39:05 -0700

Chirag & ir,

Thank you for your help!

Since threads are handled automatically, it seems that ir's solution ofstarting up separate instances of nutch crawl is the only way to get thingsdone in parallell.

First question: How do I get multiple instances of "nutch crawl" running atthe same time?Do I simply start the same "Nutch"-script repeatedly with differentparameters or do I need to have completely separate folders, each with acomplete Nutch installation to be able to have separate urls-file andcrawl-urlfilter.txt for each instance?

Second question: Starting 20 instances of Nutch will lead to 20 indexes. CanI combine them with "merge" or "mergesegs" without losing any data?


* Ir: I am definitely interested in your ideas, please explain further!

* Chirag: My version of Nutch is 0.6, just downloaded from the project'shomepage. I searched the web for patches for Nutch, the best hit washttp://issues.apache.org/jira/secure/IssueNavigator.jspa Most of theseissues seem to have files attached to them, but I cant decide whether theyare all acutal patches. Is one of these (Nutch-54?) the patch by Andrzejthat you suggested and if so, how do I install it?


Kind regards,

Jon

_________________________________________________________________
Nyhet! Hotmail direkt i din Mobil! http://mobile.msn.com/

RE: Architecture for parallell crawling

Reply via email to