[ 
https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emmanuel Joke updated NUTCH-528:
--------------------------------

    Attachment: NUTCH-528_v3.patch

New path provided following Andrzej recommandations:

??* CrawlDatum.getMetaData().toString() can easily break the CSV format, it's 
enough
that some of the keys or values contain literal double quotes or semicolons, 
not to
mention line breaks. Either you ignore the metadata, or you need to pass this 
string
through a method that will escape special characters that could break the 
format.??
==> I've removed the MetaData. I don't think its really important in this 
format.

??* CrawlDbReader.stats.sort: this property name doesn't follow the de facto
convention that we try to keep when adding new property names. I suggest
db.reader.stats.sort, and it should be added in the appropriate section of
nutch-default.xml??
==> I've also change the property CrawlDbReader.topN to be in phase with the 
convention. We don't need to add them in the config file, its juts internal 
config which are set by the args parameter in main method.

??* I think that processDumpJob should not accept a String format, and parse it
internally. In my opinion this should be the caller's responsibility, and the
argument here should be an int constant.??
==> You're right. Its now done

I took advantage of this new patch to make some modification on all classes 
implementing Hadoop Mapper/Reducer in order to remove minor error saw in 
Eclipse.

> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
>                 Key: NUTCH-528
>                 URL: https://issues.apache.org/jira/browse/NUTCH-528
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: Java 1.6, Linux 2.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-528.patch, NUTCH-528_v2.patch, NUTCH-528_v3.patch
>
>
> * I've added improve the stats to list the number of urls by status and by 
> hosts. This is an option which is not mandatory.
> For instance if you set sortByHost option, it will show:
> bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:   36
> retry 0:      36
> min score:    0.0020
> avg score:    0.059
> max score:    1.0
> status 1 (db_unfetched):      33
>    www.yahoo.com :    33
> status 2 (db_fetched):        3
>    www.yahoo.com :    3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then 
> be easy to integrate the file in Excel and make some more complex statistics.
> bin/nutch readdb crawl/crawldb -dump FOLDER toCsv
> Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since 
> fetch;Retry interval;Score;Signature;Metadata
> "http://www.yahoo.com/";1;"db_unfetched";Wed Jul 25 14:59:59 CST 2007;Thu Jan 
> 01 08:00:00 CST 1970;0;2592000.0;30.0;0.04151206;"null";"null"
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST 
> 2007;Thu Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by 
> Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to