Luis Lopez created NUTCH-2034:
---------------------------------

             Summary: CrawlDB filtered documents counter.
                 Key: NUTCH-2034
                 URL: https://issues.apache.org/jira/browse/NUTCH-2034
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
    Affects Versions: 1.10
            Reporter: Luis Lopez
            Priority: Minor
             Fix For: 1.11


When we are doing big crawls we would like to know how many of the URLs are 
being discarded by the regex filters, this is only presented in the Inject 
class:

Injector: Total number of urls rejected by filters: 0

It will be nice to have a counter in the CrawlDB class so we know in every 
round how many were discarded by our filters:

CrawlDb update: Total number of URLs filtered by regex filters: 31415





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to