Luis Lopez created NUTCH-2034:
---------------------------------
Summary: CrawlDB filtered documents counter.
Key: NUTCH-2034
URL: https://issues.apache.org/jira/browse/NUTCH-2034
Project: Nutch
Issue Type: Improvement
Components: crawldb
Affects Versions: 1.10
Reporter: Luis Lopez
Priority: Minor
Fix For: 1.11
When we are doing big crawls we would like to know how many of the URLs are
being discarded by the regex filters, this is only presented in the Inject
class:
Injector: Total number of urls rejected by filters: 0
It will be nice to have a counter in the CrawlDB class so we know in every
round how many were discarded by our filters:
CrawlDb update: Total number of URLs filtered by regex filters: 31415
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)