Yossi Tamari created NUTCH-2463:
-----------------------------------

             Summary: Enable sampling CrawlDB
                 Key: NUTCH-2463
                 URL: https://issues.apache.org/jira/browse/NUTCH-2463
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
            Reporter: Yossi Tamari
            Priority: Minor


CrawlDB can grow to contain billions of records. When that happens *readdb 
-dump* is pretty useless, and *readdb -topN* can run for ages (and does not 
provide a statistically correct sample).
We should add a parameter *-sample* to *readdb -dump* which is followed by a 
number between 0 and 1, and only that fraction of records from the CrawlDB will 
be processed.
The sample should be statistically random, and all the other filters should be 
applied on the sampled records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to