CrawlDbReader: add some new stats + dump into a csv format
----------------------------------------------------------
Key: NUTCH-528
URL: https://issues.apache.org/jira/browse/NUTCH-528
Project: Nutch
Issue Type: Improvement
Environment: Java 1.6, Linux 2.6
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
Fix For: 1.0.0
* I've added improve the stats to list the number of urls by status and by
hosts. This is an option which is not mandatory.
For instance if you set sortByHost option, it will show:
bin/nutch readdb crawl/crawldb -stats sortByHost
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 36
retry 0: 36
min score: 0.0020
avg score: 0.059
max score: 1.0
status 1 (db_unfetched): 33
www.yahoo.com : 33
status 2 (db_fetched): 3
www.yahoo.com : 3
CrawlDb statistics: done
Of course without this option the stats are unchanged.
* I've add a new option to dump the crawldb into a CSV format. It will then be
easy to integrate the file in Excel and make some more complex statistics.
bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
Extract of the file:
Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry
interval;Score;Signature;Metadata
"http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan
01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
"http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST
2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
"http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST
2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
* I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers