leerho commented on issue #514:
URL: 
https://github.com/apache/datasketches-java/issues/514#issuecomment-1970067490

   @drewdahlke,
   What you are experiencing is not a problem with the sketch.  The problem is 
that your test code is corrupting your source stream, which you have modeled as 
an ArrayList<>.  
   
   ### Analysis
   Let me illustrate with some code snippets and test runs and I'll attach the 
whole test class I am using at the end.  I have made a few changes to your code:
   - To make it more easily modifiable, like putting some of your magic numbers 
at the top, which also makes the code more transparent.
   - I have renamed some of your variables to make it a bit more clear what 
their function is. Of these the most important is to distinguish between your 
source list and the merging target sketch.
   - I have reduced the number of sketches in your source list to 10 (from 90) 
as it demonstrates your problem easily and makes printing more reasonable. The 
shuffle is commented out initially.
   - I also moved the declaration of the merge target sketch above the trials 
loop. You will see why in a few moments.
   - I added a `printNarr(baseline)` to illustrate what is going on inside your 
source list.
   
   Your slightly modified test code looks like this:
   ```
    @Test
     public void kllMergeOrdering() {
       List<KllDoublesSketch> baseline = new ArrayList<>();
       final int numSk = 10; // was 90
       final int u = 100;
       final int k = 200;
       final int trials = 1000;
       for(int x = 0; x < numSk; x++) {
         KllDoublesSketch src = getUpdatableDirectDoublesSketch(k, 0, u);
         DoubleStream.iterate(0.0, i -> i + 100.0).limit(u).forEach(d -> 
src.update(d));
         baseline.add(src);
       }
   
       KllDoublesSketch tgt;  //getUpdatableDirectDoublesSketch(k, 0, u * 
numSk);
       for(int t = 0; t < trials; t ++) {
         print("TRIAL: " + t);
         tgt = null;  //.reset();
         /******* Commenting out the shuffle makes it work?? **********/
         //Collections.shuffle(baseline, rand);
         for (KllDoublesSketch src : baseline) {
           if (tgt == null) {
             tgt = src;
           } else {
             tgt.merge(src);
           }
           printNarr(baseline);
         }
       }
     }
   ```
   If you run this code as is, you will notice that it compiles and runs 
without errors.  But it is not working correctly!
   The top and bottom of the console printout looks like this:
   ```
   TRIAL: 0: [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
      ...
   TRIAL: 999: [899200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
   : [899600, 100, 100, 100, 100, 100, 100, 100, 100, 100]
      ...
   ```
   Whoa!  The first sketch of your source list is being modified every 
iteration of the inner loop! So your source data is not what you think it is!  
The culprit is this: 
   ```
           ...
           if (tgt == null) {
             tgt = src;      // <-- this is assigning one of the sketches in 
the source list to be the target!  Oops!
           } else {
             tgt.merge(src);
           }
           ...
   ```
   The effect of that is that the first sketch in the list is also the target, 
which gets updated constantly.
   
   Now enable the shuffle by uncommenting it and look at the output:
   ```
   ...
   TRIAL: 59: [1625023502369262100, 210984021615402100, 6473720995753347100, 
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 
105492010807899100, 816447775525480900, 25943560095700]
   : [1836007523984664200, 210984021615402100, 6473720995753347100, 
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 
105492010807899100, 816447775525480900, 25943560095700]
   : [8309728519738011300, 210984021615402100, 6473720995753347100, 
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 
105492010807899100, 816447775525480900, 25943560095700]
   : [8717995943465926000, 210984021615402100, 6473720995753347100, 
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300, 
105492010807899100, 816447775525480900, 25943560095700]
   FAILED: kllMergeOrdering
   java.lang.AssertionError: The input count has exceeded the capacity of a 
long and the capability of this sketch.
   ```
   By trial 59, the all the sketches in the source list have been corrupted by 
being constantly interchanged with the actual target sketch.  Because the merge 
target sketch eventually appears in all the slots, the merge sum keep 
accumulating and essentially multiplying by 10 (the list size), which is 
effectively exponential growth of N.  Quickly it exceeds the capacity of a long 
(which is huge), and the test fails because the sketch is not designed to 
handle numbers this large.   The specific AssertionError you see here is not 
yet part of a release yet.  Nonetheless, the sketch will crash.  
   
   To fix the above code, do the following:
   - Change the line: `KllDoublesSketch tgt;  
//getUpdatableDirectDoublesSketch(k, 0, u * numSk);` 
   to: `KllDoublesSketch tgt = getUpdatableDirectDoublesSketch(k, 0, u * 
numSk);`.
   - Change the line: `tgt = null;  //.reset();` to `tgt.reset();`.
   
   
   ### Comments
   - Unfortunately, your test isn't even a test of merge order, because all of 
the sketches in your source list are created to be exactly the same!  So even 
though you are shuffling the list between trials, the data being fed to the 
sketch is not changing (assuming the test is written correctly!), so the sketch 
doesn't see anything different.
   - Especially when writing tests, it is important to follow the K.I.S.S. rule 
and make your code as simple and transparent as possible. I will try to 
illustrate this with a rewrite of your code as follows:
   
   ```
     @Test
     public void betterKllMergeOrdering() {
       final int numSk = 10;
       final int u = 100;
       final int k = 200;
       final int trials = 1000;
       final KllDoublesSketch[] srcArr = new KllDoublesSketch[numSk];
       double v = 0.0;
       for(int i = 0; i < numSk; i++) {
         KllDoublesSketch src = getUpdatableDoublesSketch(k, 0);
         for (int j = 0; j < u; j++) { src.update(v += 100.0); }
         srcArr[i] = src;
       }
   
       KllDoublesSketch tgt = getUpdatableDoublesSketch(k, 0);
       for (int t = 1; t <= trials; t ++) {
         println("TRIAL: " + t);
         tgt.reset();
         shuffle(srcArr, rand);
         for (KllDoublesSketch src : srcArr) {
           tgt.merge(src);
           printNarr(srcArr);
         }
         assert tgt.getN() == numSk * u : "Test invariant of Total N has been 
violated.";
       }
     }
   ```
   
   - Your use of Collections, Lists and Streams is overly complex and abstract 
for this task and doesn't save any lines of code.   In fact, this simpler 
version is a few lines shorter.  Using simple for-loops and primitive arrays is 
not only easier to understand and debug, but it will run faster.  Collections 
are notoriously slow and use lots of memory.  For example, go look at the 
source code behind the `Collections.shuffle(...)` It is doing lots of extra 
data movement by first converting the collection to an array, then doing the 
shuffle, then moving the array back to the collection.  
   - Your test also doesn't do any real checking (perhaps because you wanted to 
simplify your example).  Nonetheless, as a simple, but not comprehensive check, 
I added an assert statement at the end that confirms that the Total N of the 
target sketch is just the number of sketches in your source array, times the 
number of values entered into each sketch.  This is a test _invariant_, as it 
should not change even though the values entered into each sketch may vary.
   - I also substituted the direct sketch for the heap-based sketch in the 
simplified code.  They both will behave the same, but the heap-based sketch is 
easier to see what is going on from my IDE.  The choice is up to you.  
   
   I hope this helps.
   
   Cheers,
   Lee.
   
   
[JavaIssue514.java.zip](https://github.com/apache/datasketches-java/files/14440708/JavaIssue514.java.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to