[
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516366
]
Doğacan Güney commented on NUTCH-528:
-------------------------------------
This is my personal nit, but the cli options look weird. Why not something like:
bin/nutch readdb crawl/crawldb -stats -sortByHost
or even better (if we can also sort by something else)
bin/nutch readdb crawl/crawldb -stats -sort host|foo|bar
Same is true for CSV:
bin/nutch readdb crawl/crawldb -dump FOLDER -format csv|normal (with default
being 'normal', normal being the old output format)
Besides these issues, I think these changes are useful and I would like to see
them in.
> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
> Key: NUTCH-528
> URL: https://issues.apache.org/jira/browse/NUTCH-528
> Project: Nutch
> Issue Type: Improvement
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Assignee: Emmanuel Joke
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-528.patch
>
>
> * I've added improve the stats to list the number of urls by status and by
> hosts. This is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls: 36
> retry 0: 36
> min score: 0.0020
> avg score: 0.059
> max score: 1.0
> status 1 (db_unfetched): 33
> www.yahoo.com : 33
> status 2 (db_fetched): 3
> www.yahoo.com : 3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then
> be easy to integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since
> fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan
> 01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by
> Andrzej.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers