Paolo Castagna wrote:
Crawl.java:

line 84    if (fs.exists(dir)) {
line 85      throw new RuntimeException(dir + " already exists.");
line 86    }

What are the side effects of removing those lines from Crawl.java?



Hi,

These lines were put there to warn off users from running the crawl tool several times on the same directory - this would cause incremental updates to the crawldb, not only at the stage of generation (which is a natural occurence anyway), but also on the stage of injection, i.e. the tool would inject possibly the same urls many times to the same crawldb.

Again, it doesn't sound strange to do this, with a small exception ;) In previous releases of Nutch, there was a bug in the injector which would reset the status of urls in crawldb, if the same url was injected again. This bug has been fixed now, so I think it's safe to remove these lines from Crawl.java.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to