[
https://issues.apache.org/jira/browse/NUTCH-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346558#comment-15346558
]
ASF GitHub Bot commented on NUTCH-2286:
---------------------------------------
GitHub user sebastian-nagel opened a pull request:
https://github.com/apache/nutch/pull/125
NUTCH-2286: CrawlDbReader -stats to show fetch time and interval
Show stats (min, max, average) for fetch time and fetch interval. Improve
TimingUtil when converting times in (milli)seconds to human-readable formats.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sebastian-nagel/nutch CrawlDbStats
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/nutch/pull/125.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #125
----
commit 200d53c113ffdcf2e541029031fd8c4bd814c54d
Author: Sebastian Nagel <[email protected]>
Date: 2016-06-20T12:42:04Z
CrawlDb statistics: add fetch time (earliest, latest, average)
commit 209bea43a2a76d765b0f1704066897f26ec9c72d
Author: Sebastian Nagel <[email protected]>
Date: 2016-06-22T14:22:33Z
CrawlDb statistics: add fetch interval (shortest, longest, average)
commit d571f52189569bc91d185eb9a6eeb7adcc78ad24
Author: Sebastian Nagel <[email protected]>
Date: 2016-06-23T14:32:48Z
CrawlDb statistics: avoid overflow in sum of fetch times for large CrawlDb
----
> CrawlDbReader -stats to show fetch time and interval
> ----------------------------------------------------
>
> Key: NUTCH-2286
> URL: https://issues.apache.org/jira/browse/NUTCH-2286
> Project: Nutch
> Issue Type: Improvement
> Components: crawldb
> Affects Versions: 1.12
> Reporter: Sebastian Nagel
> Priority: Minor
> Fix For: 1.13
>
>
> An overview about fetch times and fetch intervals could be useful to
> configure a crawl. CrawlDbReader could easily calculate min, max and average
> and show it as part of the statistics job (command-line option {{-stats}}):
> {noformat}
> % bin/nutch readdb .../crawldb/ -stats
> ...
> TOTAL urls: 544910
> shortest fetch interval: 7 days, 00:00:00
> avg fetch interval: 7 days, 17:43:58
> longest fetch interval: 10 days, 12:00:00
> earliest fetch time: Wed May 25 11:42:00 CEST 2016
> avg of fetch times: Sun Jun 05 18:11:00 CEST 2016
> latest fetch time: Wed Jun 22 10:25:00 CEST 2016
> ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)