leerho commented on issue #514:
URL:
https://github.com/apache/datasketches-java/issues/514#issuecomment-1970067490
@drewdahlke,
What you are experiencing is not a problem with the sketch. The problem is
that your test code is corrupting your source stream, which you have modeled as
an ArrayList<>.
### Analysis
Let me illustrate with some code snippets and test runs and I'll attach the
whole test class I am using at the end. I have made a few changes to your code:
- To make it more easily modifiable, like putting some of your magic numbers
at the top, which also makes the code more transparent.
- I have renamed some of your variables to make it a bit more clear what
their function is. Of these the most important is to distinguish between your
source list and the merging target sketch.
- I have reduced the number of sketches in your source list to 10 (from 90)
as it demonstrates your problem easily and makes printing more reasonable. The
shuffle is commented out initially.
- I also moved the declaration of the merge target sketch above the trials
loop. You will see why in a few moments.
- I added a `printNarr(baseline)` to illustrate what is going on inside your
source list.
Your slightly modified test code looks like this:
```
@Test
public void kllMergeOrdering() {
List<KllDoublesSketch> baseline = new ArrayList<>();
final int numSk = 10; // was 90
final int u = 100;
final int k = 200;
final int trials = 1000;
for(int x = 0; x < numSk; x++) {
KllDoublesSketch src = getUpdatableDirectDoublesSketch(k, 0, u);
DoubleStream.iterate(0.0, i -> i + 100.0).limit(u).forEach(d ->
src.update(d));
baseline.add(src);
}
KllDoublesSketch tgt; //getUpdatableDirectDoublesSketch(k, 0, u *
numSk);
for(int t = 0; t < trials; t ++) {
print("TRIAL: " + t);
tgt = null; //.reset();
/******* Commenting out the shuffle makes it work?? **********/
//Collections.shuffle(baseline, rand);
for (KllDoublesSketch src : baseline) {
if (tgt == null) {
tgt = src;
} else {
tgt.merge(src);
}
printNarr(baseline);
}
}
}
```
If you run this code as is, you will notice that it compiles and runs
without errors. But it is not working correctly!
The top and bottom of the console printout looks like this:
```
TRIAL: 0: [100, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
...
TRIAL: 999: [899200, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [899300, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [899400, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [899500, 100, 100, 100, 100, 100, 100, 100, 100, 100]
: [899600, 100, 100, 100, 100, 100, 100, 100, 100, 100]
...
```
Whoa! The first sketch of your source list is being modified every
iteration of the inner loop! So your source data is not what you think it is!
The culprit is this:
```
...
if (tgt == null) {
tgt = src; // <-- this is assigning one of the sketches in
the source list to be the target! Oops!
} else {
tgt.merge(src);
}
...
```
The effect of that is that the first sketch in the list is also the target,
which gets updated constantly.
Now enable the shuffle by uncommenting it and look at the output:
```
...
TRIAL: 59: [1625023502369262100, 210984021615402100, 6473720995753347100,
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300,
105492010807899100, 816447775525480900, 25943560095700]
: [1836007523984664200, 210984021615402100, 6473720995753347100,
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300,
105492010807899100, 816447775525480900, 25943560095700]
: [8309728519738011300, 210984021615402100, 6473720995753347100,
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300,
105492010807899100, 816447775525480900, 25943560095700]
: [8717995943465926000, 210984021615402100, 6473720995753347100,
408267423727914700, 3250047004738498900, 601018097404600, 4086289825189300,
105492010807899100, 816447775525480900, 25943560095700]
FAILED: kllMergeOrdering
java.lang.AssertionError: The input count has exceeded the capacity of a
long and the capability of this sketch.
```
By trial 59, the all the sketches in the source list have been corrupted by
being constantly interchanged with the actual target sketch. Because the merge
target sketch eventually appears in all the slots, the merge sum keep
accumulating and essentially multiplying by 10 (the list size), which is
effectively exponential growth of N. Quickly it exceeds the capacity of a long
(which is huge), and the test fails because the sketch is not designed to
handle numbers this large. The specific AssertionError you see here is not
yet part of a release yet. Nonetheless, the sketch will crash.
To fix the above code, do the following:
- Change the line: `KllDoublesSketch tgt;
//getUpdatableDirectDoublesSketch(k, 0, u * numSk);`
to: `KllDoublesSketch tgt = getUpdatableDirectDoublesSketch(k, 0, u *
numSk);`.
- Change the line: `tgt = null; //.reset();` to `tgt.reset();`.
### Comments
- Unfortunately, your test isn't even a test of merge order, because all of
the sketches in your source list are created to be exactly the same! So even
though you are shuffling the list between trials, the data being fed to the
sketch is not changing (assuming the test is written correctly!), so the sketch
doesn't see anything different.
- Especially when writing tests, it is important to follow the K.I.S.S. rule
and make your code as simple and transparent as possible. I will try to
illustrate this with a rewrite of your code as follows:
```
@Test
public void betterKllMergeOrdering() {
final int numSk = 10;
final int u = 100;
final int k = 200;
final int trials = 1000;
final KllDoublesSketch[] srcArr = new KllDoublesSketch[numSk];
double v = 0.0;
for(int i = 0; i < numSk; i++) {
KllDoublesSketch src = getUpdatableDoublesSketch(k, 0);
for (int j = 0; j < u; j++) { src.update(v += 100.0); }
srcArr[i] = src;
}
KllDoublesSketch tgt = getUpdatableDoublesSketch(k, 0);
for (int t = 1; t <= trials; t ++) {
println("TRIAL: " + t);
tgt.reset();
shuffle(srcArr, rand);
for (KllDoublesSketch src : srcArr) {
tgt.merge(src);
printNarr(srcArr);
}
assert tgt.getN() == numSk * u : "Test invariant of Total N has been
violated.";
}
}
```
- Your use of Collections, Lists and Streams is overly complex and abstract
for this task and doesn't save any lines of code. In fact, this simpler
version is a few lines shorter. Using simple for-loops and primitive arrays is
not only easier to understand and debug, but it will run faster. Collections
are notoriously slow and use lots of memory. For example, go look at the
source code behind the `Collections.shuffle(...)` It is doing lots of extra
data movement by first converting the collection to an array, then doing the
shuffle, then moving the array back to the collection.
- Your test also doesn't do any real checking (perhaps because you wanted to
simplify your example). Nonetheless, as a simple, but not comprehensive check,
I added an assert statement at the end that confirms that the Total N of the
target sketch is just the number of sketches in your source array, times the
number of values entered into each sketch. This is a test _invariant_, as it
should not change even though the values entered into each sketch may vary.
- I also substituted the direct sketch for the heap-based sketch in the
simplified code. They both will behave the same, but the heap-based sketch is
easier to see what is going on from my IDE. The choice is up to you.
I hope this helps.
Cheers,
Lee.
[JavaIssue514.java.zip](https://github.com/apache/datasketches-java/files/14440708/JavaIssue514.java.zip)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]