[ 
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554560
 ] 

Andrzej Bialecki  commented on NUTCH-528:
-----------------------------------------

Thanks for a gentle reminder :) After reviewing the patch v2 I have several 
comments:

* CrawlDatum.getMetaData().toString() can easily break the CSV format, it's 
enough that some of the keys or values contain literal double quotes or 
semicolons, not to mention line breaks. Either you ignore the metadata, or you 
need to pass this string through a method that will escape special characters 
that could break the format.

* CrawlDbReader.stats.sort: this property name doesn't follow the de facto 
convention that we try to keep when adding new property names. I suggest 
db.reader.stats.sort, and it should be added in the appropriate section of 
nutch-default.xml

* I think that processDumpJob should not accept a String format, and parse it 
internally. In my opinion this should be the caller's responsibility, and the 
argument here should be an int constant.

* the section that parses input arguments should indicate bad arguments - but 
this patch removes this warning.

* a minor issue: the patch uses inconsistent whitespace (e.g. {{if(sort){}}, 
{{if(st.length >2 )}}, or {{ format = args[i=i+2];}}), this should be fixed so 
that it follows the coding convention.

> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
>                 Key: NUTCH-528
>                 URL: https://issues.apache.org/jira/browse/NUTCH-528
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-528.patch, NUTCH-528_v2.patch
>
>
> * I've added improve the stats to list the number of urls by status and by 
> hosts. This is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:   36
> retry 0:      36
> min score:    0.0020
> avg score:    0.059
> max score:    1.0
> status 1 (db_unfetched):      33
>    www.yahoo.com :    33
> status 2 (db_fetched):        3
>    www.yahoo.com :    3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then 
> be easy to integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since 
> fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 
> 01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by 
> Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to