RE: individual crawl-urlfilter.txt and nutch-site.xml for different crawls?

Joe Malcolm Mon, 30 Jun 2008 12:46:39 -0700

It may be worth keeping in mind that Nutch runs the parsing plugins
and therefore uses regex-urlfilter.txt at the parsing stage,
immediately post-crawl. That means that any links it filters out never
make it into the segment data, and therefore will never make it into
the crawldb. I do not know whether crawl-urlfilter.txt is handled
similarly.

Joe

RE: individual crawl-urlfilter.txt and nutch-site.xml for different crawls?

Reply via email to