CrawlDBScanner --------------- Key: NUTCH-784 URL: https://issues.apache.org/jira/browse/NUTCH-784 Project: Nutch Issue Type: New Feature Reporter: Julien Nioche Assignee: Julien Nioche Attachments: NUTCH-784.patch
The patch file contains a utility which dumps all the entries matching a regular expression on their URL. The dump mechanism of the crawldb reader is not very useful on large crawldbs as the ouput can be extremely large and the -url function can't help if we don't know what url we want to have a look at. The CrawlDBScanner can either generate a text representation of the CrawlDatum-s or binary objects which can then be used as a new CrawlDB. Usage: CrawlDBScanner <crawldb> <output> <regex> [-s <status>] <-text> regex: regular expression on the crawldb key -s status : constraint on the status of the crawldb entries e.g. db_fetched, db_unfetched -text : if this parameter is used, the output will be of TextOutputFormat; otherwise it generates a 'normal' crawldb with the MapFileOutputFormat for instance the command below : ./nutch com.ant.CrawlDBScanner crawl/crawldb /tmp/amazon-dump .+amazon.com.* -s db_fetched -text will generate a text file /tmp/amazon-dump containing all the entries of the crawldb matching the regexp .+amazon.com.* and having a status of db_fetched -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.