It may be worth keeping in mind that Nutch runs the parsing plugins
and therefore uses regex-urlfilter.txt at the parsing stage,
immediately post-crawl. That means that any links it filters out never
make it into the segment data, and therefore will never make it into
the crawldb. I do not know whether crawl-urlfilter.txt is handled
similarly.

Joe

Reply via email to