[ https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290954#comment-16290954 ]
Sebastian Nagel commented on NUTCH-2474: ---------------------------------------- Tested at scale in distributed mode: {noformat} TOTAL urls: 17390543871 shortest fetch interval: 00:01:00 avg fetch interval: 13 days, 14:07:41 longest fetch interval: 547 days, 12:00:00 earliest fetch time: Wed Jun 24 17:30:00 UTC 2015 avg of fetch times: Wed Jun 21 13:34:00 UTC 2017 latest fetch time: Sun Apr 29 13:19:00 UTC 2018 ... score quantile 0.01: -0.002356610852291141 score quantile 0.05: 0.0 score quantile 0.1: 0.0 score quantile 0.2: 0.0 score quantile 0.25: 0.0 score quantile 0.3: 0.0 score quantile 0.4: 0.0 score quantile 0.5: 9.06446965986782E-29 score quantile 0.6: 4.5256326704019776E-7 score quantile 0.7: 2.2914155437921596E-5 score quantile 0.75: 8.547483341977483E-5 score quantile 0.8: 2.048824943119118E-4 score quantile 0.9: 8.283822892023804E-4 score quantile 0.95: 0.005019234934509838 score quantile 0.99: 0.29275233729514805 min score: -179359.0 avg score: 2.0286281191487183 max score: 6.808600064E9 ... {noformat} Quantiles are successfully calculated (and much more useful as they're less influenced by few outliers). Also min/max of fetch time and interval look now correct (see NUTCH-2297). > CrawlDbReader -stats fails with ClassCastException > -------------------------------------------------- > > Key: NUTCH-2474 > URL: https://issues.apache.org/jira/browse/NUTCH-2474 > Project: Nutch > Issue Type: Bug > Components: crawldb > Affects Versions: 1.14 > Environment: Java 8, distributed mode: Hadoop CDH 5.13.0 > Reporter: Sebastian Nagel > Priority: Critical > Fix For: 1.14 > > > In distributed mode CrawlDbReader / readdb -stats fails with a > ClassCastException in the combiner: > {noformat} > 17/12/08 04:57:13 INFO mapreduce.Job: Task Id : > attempt_1512553291624_0022_m_000039_0, Status : FAILED > Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable > cannot be cast to org.apache.hadoop.io.LongWritable > at > org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296) > at > org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222) > at > org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > {noformat} > FloatWritables are used since NUTCH-2470, so that's when this bug was > introduced. -- This message was sent by Atlassian JIRA (v6.4.14#64029)