[
https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290954#comment-16290954
]
Sebastian Nagel commented on NUTCH-2474:
----------------------------------------
Tested at scale in distributed mode:
{noformat}
TOTAL urls: 17390543871
shortest fetch interval: 00:01:00
avg fetch interval: 13 days, 14:07:41
longest fetch interval: 547 days, 12:00:00
earliest fetch time: Wed Jun 24 17:30:00 UTC 2015
avg of fetch times: Wed Jun 21 13:34:00 UTC 2017
latest fetch time: Sun Apr 29 13:19:00 UTC 2018
...
score quantile 0.01: -0.002356610852291141
score quantile 0.05: 0.0
score quantile 0.1: 0.0
score quantile 0.2: 0.0
score quantile 0.25: 0.0
score quantile 0.3: 0.0
score quantile 0.4: 0.0
score quantile 0.5: 9.06446965986782E-29
score quantile 0.6: 4.5256326704019776E-7
score quantile 0.7: 2.2914155437921596E-5
score quantile 0.75: 8.547483341977483E-5
score quantile 0.8: 2.048824943119118E-4
score quantile 0.9: 8.283822892023804E-4
score quantile 0.95: 0.005019234934509838
score quantile 0.99: 0.29275233729514805
min score: -179359.0
avg score: 2.0286281191487183
max score: 6.808600064E9
...
{noformat}
Quantiles are successfully calculated (and much more useful as they're less
influenced by few outliers). Also min/max of fetch time and interval look now
correct (see NUTCH-2297).
> CrawlDbReader -stats fails with ClassCastException
> --------------------------------------------------
>
> Key: NUTCH-2474
> URL: https://issues.apache.org/jira/browse/NUTCH-2474
> Project: Nutch
> Issue Type: Bug
> Components: crawldb
> Affects Versions: 1.14
> Environment: Java 8, distributed mode: Hadoop CDH 5.13.0
> Reporter: Sebastian Nagel
> Priority: Critical
> Fix For: 1.14
>
>
> In distributed mode CrawlDbReader / readdb -stats fails with a
> ClassCastException in the combiner:
> {noformat}
> 17/12/08 04:57:13 INFO mapreduce.Job: Task Id :
> attempt_1512553291624_0022_m_000039_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable
> cannot be cast to org.apache.hadoop.io.LongWritable
> at
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296)
> at
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222)
> at
> org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946)
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> {noformat}
> FloatWritables are used since NUTCH-2470, so that's when this bug was
> introduced.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)