[ 
https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290954#comment-16290954
 ] 

Sebastian Nagel commented on NUTCH-2474:
----------------------------------------

Tested at scale in distributed mode:
{noformat}
TOTAL urls:     17390543871
shortest fetch interval:        00:01:00
avg fetch interval:     13 days, 14:07:41
longest fetch interval: 547 days, 12:00:00
earliest fetch time:    Wed Jun 24 17:30:00 UTC 2015
avg of fetch times:     Wed Jun 21 13:34:00 UTC 2017
latest fetch time:      Sun Apr 29 13:19:00 UTC 2018
...
score quantile 0.01:    -0.002356610852291141
score quantile 0.05:    0.0
score quantile 0.1:     0.0
score quantile 0.2:     0.0
score quantile 0.25:    0.0
score quantile 0.3:     0.0
score quantile 0.4:     0.0
score quantile 0.5:     9.06446965986782E-29
score quantile 0.6:     4.5256326704019776E-7
score quantile 0.7:     2.2914155437921596E-5
score quantile 0.75:    8.547483341977483E-5
score quantile 0.8:     2.048824943119118E-4
score quantile 0.9:     8.283822892023804E-4
score quantile 0.95:    0.005019234934509838
score quantile 0.99:    0.29275233729514805
min score:      -179359.0
avg score:      2.0286281191487183
max score:      6.808600064E9
...
{noformat}
Quantiles are successfully calculated (and much more useful as they're less 
influenced by few outliers). Also min/max of fetch time and interval look now 
correct (see NUTCH-2297).

> CrawlDbReader -stats fails with ClassCastException
> --------------------------------------------------
>
>                 Key: NUTCH-2474
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2474
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.14
>         Environment: Java 8, distributed mode: Hadoop CDH 5.13.0
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.14
>
>
> In distributed mode CrawlDbReader / readdb -stats fails with a 
> ClassCastException in the combiner:
> {noformat}
> 17/12/08 04:57:13 INFO mapreduce.Job: Task Id : 
> attempt_1512553291624_0022_m_000039_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable 
> cannot be cast to org.apache.hadoop.io.LongWritable
>         at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296)
>         at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222)
>         at 
> org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946)
>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> {noformat}
> FloatWritables are used since NUTCH-2470, so that's when this bug was 
> introduced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to