[
https://issues.apache.org/jira/browse/NUTCH-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15411716#comment-15411716
]
Sebastian Nagel commented on NUTCH-2297:
----------------------------------------
The wrong values are already in the temporary output of the stats job:
# comment out {{fileSystem.delete(tmpFolder, true);}} in
CrawlDbReader.processStatJobHelper(...)
# dump data in {{crawldb/stat_tmpXXXX}} via {{hadoop fs -text ...}}
While there is only one value for the minima (scn, fin, ftn) there are multiple
values for totals and maxima:
{noformat}
retry 1 148125397
retry 2 82761892
retry 3 41645830
scn 0
scx 7369
sct 14807601
scx 7110
sct 20791107
scx 8390
sct 13135199
... (scx and sct repeating)
scx 7010
sct 17505486
fin 15120000
fix 1360800
fit 1336710211200
fix 1360800
fit 1180199008800
...
fix 1360800
fit 1319982048000
ftn 597986250
ftx 26821441
ftt 35611037001815
ftx 26821441
...
{noformat}
The values for "fin" and "ftn" are already wrong at this point:
{noformat}
# 15120000 sec. = 175 days
% echo $((15120000/(60*60*24)))
175
# 597986250 as "epoche minutes":
% date -u --date=@$((597986250*60))
Thu Dec 20 05:30:00 UTC 3106
{noformat}
Need to trace what's going wrong in the CrawlDbStatMapper / CrawlDbStatCombiner
/ CrawlDbStatReducer.
> CrawlDbReader -stats wrong values for earliest fetch time and shortest
> interval
> -------------------------------------------------------------------------------
>
> Key: NUTCH-2297
> URL: https://issues.apache.org/jira/browse/NUTCH-2297
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.13
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 1.13
>
>
> NUTCH-2286 added min, max and average for fetch interval and fetch time.
> When running in distributed mode (not reproducible in local mode), the values
> for the minimum (earliest fetch time and shortest fetch interval) may be
> wrong with implausible values:
> {noformat}
> TOTAL urls: 7180518032
> shortest fetch interval: 175 days, 00:00:00 <<<<<< ????
> avg fetch interval: 10 days, 08:01:36
> longest fetch interval: 15 days, 18:00:00
> earliest fetch time: Thu Dec 20 05:30:00 UTC 3106 <<<<<< ????
> avg of fetch times: Fri Feb 19 00:07:00 UTC 2016
> latest fetch time: Mon Jul 18 05:22:00 UTC 2016
> retry 0: 6907984913
> retry 1: 148125397
> retry 2: 82761892
> retry 3: 41645830
> min score: 0.0
> avg score: 0.014360981
> max score: 9.25
> ...
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)