[jira] Created: (NUTCH-784) CrawlDBScanner

Julien Nioche (JIRA) Mon, 01 Feb 2010 06:34:16 -0800

CrawlDBScanner 
---------------

                 Key: NUTCH-784
                 URL: https://issues.apache.org/jira/browse/NUTCH-784
             Project: Nutch
          Issue Type: New Feature
            Reporter: Julien Nioche
            Assignee: Julien Nioche
         Attachments: NUTCH-784.patch


The patch file contains a utility which dumps all the entries matching a 
regular expression on their URL. The dump mechanism of the crawldb reader is 
not  very useful on large crawldbs as the ouput can be extremely large and the 
-url  function can't help if we don't know what url we want to have a look at.

The CrawlDBScanner can either generate a text representation of the 
CrawlDatum-s or binary objects which can then be used as a new CrawlDB. 

Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text>

regex: regular expression on the crawldb key
-s status : constraint on the status of the crawldb entries e.g. db_fetched, 
db_unfetched
-text : if this parameter is used, the output will be of TextOutputFormat; 
otherwise it generates a 'normal' crawldb with the MapFileOutputFormat

for instance the command below : 
./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s 
db_fetched -text

will generate a text file /tmp/amazon-dump containing all the entries of the 
crawldb matching the regexp  .+amazon.com.* and having a status of db_fetched




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-784) CrawlDBScanner

Reply via email to