[ 
https://issues.apache.org/jira/browse/NUTCH-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911089#comment-17911089
 ] 

Sebastian Nagel commented on NUTCH-3102:
----------------------------------------

Ok, I'm able to reproduce the issue given the serialization bytes of the 
MergingDigest:
{noformat}
$> jshell --class-path build/lib/t-digest-3.3.jar

jshell> import java.nio.ByteBuffer;
jshell> import com.tdunning.math.stats.TDigest;
jshell> import com.tdunning.math.stats.MergingDigest;

jshell> int[] tdigestSer = { 0x00, 0x00, 0x00, 0x02, 0xff, 0xf8, 0x00, 0x00, 
0x00, 0x00, 0x00, 0x00, 0xff, 0xf8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x42, 
0xc8, 0x00, 0x00, 0x00, 0xd2, 0x04, 0x1a, 0x00, 0x17, 0x42, 0x8e, 0x00, 0x00, 
0xff, 0xc0, 0x00, 0x00, 0x40, 0x40, 0x00, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x42, 
0x14, 0x00, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x42, 0x60, 0x00, 0x00, 0xff, 0xc0, 
0x00, 0x00, 0x42, 0xaa, 0x00, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x43, 0xa2, 0x80, 
0x00, 0x47, 0xaf, 0x57, 0x9b, 0x45, 0x7f, 0xd0, 0x00, 0x4a, 0xcf, 0xc0, 0xdb, 
0x43, 0x7d, 0x00, 0x00, 0x4d, 0xac, 0x61, 0x02, 0x45, 0x72, 0xb0, 0x00, 0x4e, 
0x8d, 0x9d, 0xbd, 0x43, 0x67, 0x00, 0x00, 0x66, 0xe1, 0x9a, 0x9c, 0x45, 0xaa, 
0x70, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x45, 0x72, 0x40, 0x00, 0xff, 0xc0, 0x00, 
0x00, 0x45, 0xd2, 0x88, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x46, 0x10, 0x10, 0x00, 
0xff, 0xc0, 0x00, 0x00, 0x46, 0x1f, 0x7c, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x46, 
0x0b, 0x98, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x46, 0x31, 0xb8, 0x00, 0xff, 0xc0, 
0x00, 0x00, 0x46, 0x12, 0x2c, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x45, 0xcb, 0x40, 
0x00, 0xff, 0xc0, 0x00, 0x00, 0x45, 0x6d, 0xf0, 0x00, 0xff, 0xc0, 0x00, 0x00, 
0x45, 0x78, 0x70, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x45, 0x9e, 0x60, 0x00, 0xff, 
0xc0, 0x00, 0x00, 0x45, 0x94, 0xd8, 0x00, 0xff, 0xc0, 0x00, 0x00, 0x00, 0x00, 
0x00, 0x00, 0x00, 0x00, 0x00, 0x00 };
tdigestSer ==> int[222] { 0, 0, 0, 2, 255, 248, 0, 0, 0, 0, 0, 0 ... , 0, 0, 0, 
0, 0, 0, 0, 0 }

jshell> byte[] tdigestSerBytes = new byte[tdigest.length];
tdigestSerBytes ==> byte[222] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ... , 0, 
0, 0, 0, 0, 0, 0, 0 }

jshell> for (int i = 0; i < tdigestSer.length; i++) { tdigestSerBytes[i] = 
(byte) tdigestSer[i];}

jshell> MergingDigest tdig = 
MergingDigest.fromBytes(ByteBuffer.wrap(tdigestSerBytes));
tdig ==> MergingDigest-K_2-weight-alternating-twoLevel

jshell> TDigest.createMergingDigest(100.0).add(tdig)
|  Exception java.lang.IllegalArgumentException: Cannot add NaN to t-digest
|        at MergingDigest.add (MergingDigest.java:256)
|        at MergingDigest.add (MergingDigest.java:246)
|        at AbstractTDigest.add (AbstractTDigest.java:135)
|        at (#43:1)

jshell> tdig.centroids()
$44 ==> [Centroid{centroid=NaN, count=71}, Centroid{centroid=NaN, count=3}, 
Centroid{centroid=NaN, count=37}, Centroid{centroid=NaN, count=56}, 
Centroid{centroid=NaN, count=85}, Centroid{centroid=89775.2109375, count=325}, 
Centroid{centroid=6807661.5, count=4093}, Centroid{centroid=3.61504832E8, 
count=253}, Centroid{centroid=1.187962496E9, count=3883}, 
Centroid{centroid=5.326922491088457E23, count=231}, Centroid{centroid=NaN, 
count=5454}, Centroid{centroid=NaN, count=3876}, Centroid{centroid=NaN, 
count=6737}, Centroid{centroid=NaN, count=9220}, Centroid{centroid=NaN, 
count=10207}, Centroid{centroid=NaN, count=8934}, Centroid{centroid=NaN, 
count=11374}, Centroid{centroid=NaN, count=9355}, Centroid{centroid=NaN, 
count=6504}, Centroid{centroid=NaN, count=3807}, Centroid{centroid=NaN, 
count=3975}, Centroid{centroid=NaN, count=5068}, Centroid{centroid=NaN, 
count=4763}]

{noformat}

Of course, this does not explain or make it reproducible why there are so many 
centroids being NaN.

> CrawlDbReader -stats fails with Cannot add NaN to t-digest
> ----------------------------------------------------------
>
>                 Key: NUTCH-3102
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3102
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb
>    Affects Versions: 1.19
>            Reporter: Marcos Gomez
>            Priority: Major
>             Fix For: 1.21
>
>
> When running in local mode CrawlDbReader / readdb -stats fails with 
> "java.lang.Exception: java.lang.IllegalArgumentException: Cannot add NaN to 
> t-digest"
>  
> {noformat}
> java.lang.Exception: java.lang.IllegalArgumentException: Cannot add NaN to 
> t-digest
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
> ~[hadoop-mapreduce-client-common-3.3.4.jar:?]
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:559) 
> ~[hadoop-mapreduce-client-common-3.3.4.jar:?]
> Caused by: java.lang.IllegalArgumentException: Cannot add NaN to t-digest
>     at com.tdunning.math.stats.MergingDigest.add(MergingDigest.java:256) 
> ~[t-digest-3.3.jar:?]
>     at com.tdunning.math.stats.MergingDigest.add(MergingDigest.java:246) 
> ~[t-digest-3.3.jar:?]
>     at com.tdunning.math.stats.AbstractTDigest.add(AbstractTDigest.java:135) 
> ~[t-digest-3.3.jar:?]
>     at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatReducer.reduce(CrawlDbReader.java:489)
>  ~[apache-nutch-1.19.jar:?]
>     at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatReducer.reduce(CrawlDbReader.java:422)
>  ~[apache-nutch-1.19.jar:?]
>     at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171) 
> ~[hadoop-mapreduce-client-core-3.3.4.jar:?]
>     at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628) 
> ~[hadoop-mapreduce-client-core-3.3.4.jar:?]
>     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) 
> ~[hadoop-mapreduce-client-core-3.3.4.jar:?]
>     at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:347)
>  ~[hadoop-mapreduce-client-common-3.3.4.jar:?]
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) ~[?:?]{noformat}
> I added a log to know why it's happening, and apparently it's build the tdig 
> with this value for a BytesWritable object:
> {noformat}
> Error adding scd value: 00 00 00 02 ff f8 00 00 00 00 00 00 ff f8 00 00 00 00 
> 00 00 42 c8 00 00 00 d2 04 1a 00 17 42 8e 00 00 ff c0 00 00 40 40 00 00 ff c0 
> 00 00 42 14 00 00 ff c0 00 00 42 60 00 00 ff c0 00 00 42 aa 00 00 ff c0 00 00 
> 43 a2 80 00 47 af 57 9b 45 7f d0 00 4a cf c0 db 43 7d 00 00 4d ac 61 02 45 72 
> b0 00 4e 8d 9d bd 43 67 00 00 66 e1 9a 9c 45 aa 70 00 ff c0 00 00 45 72 40 00 
> ff c0 00 00 45 d2 88 00 ff c0 00 00 46 10 10 00 ff c0 00 00 46 1f 7c 00 ff c0 
> 00 00 46 0b 98 00 ff c0 00 00 46 31 b8 00 ff c0 00 00 46 12 2c 00 ff c0 00 00 
> 45 cb 40 00 ff c0 00 00 45 6d f0 00 ff c0 00 00 45 78 70 00 ff c0 00 00 45 9e 
> 60 00 ff c0 00 00 45 94 d8 00 ff c0 00 00 00 00 00 00 00 00 00 00{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to