Lordshinjo opened a new issue #152:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/152
When called on 2 ordered sketches, it seems that `a_not_b` is severely
overestimating its result in some cases, with `a_not_b(a, b)` having an
estimate bigger than `a` if `b` is bigger than `a`.
Here is a small reproduction (on the current master):
```python
>>> import random
>>> import uuid
>>> import datasketches
>>> ids = [str(uuid.uuid4()) for _ in range(100_000)]
>>> a_ids = random.sample(ids, 10_000)
>>> b_ids = random.sample(ids, 25_000)
>>> len(set(a_ids) - set(b_ids))
7548
>>> a = datasketches.update_theta_sketch()
>>> b = datasketches.update_theta_sketch()
>>> for x in a_ids:
... a.update(x)
...
>>> for x in b_ids:
... b.update(x)
...
>>> a_compact = a.compact()
>>> b_compact = b.compact()
>>> print(a)
### Update Theta sketch summary:
lg nominal size : 12
lg current size : 13
num retained keys : 5333
resize factor : 8
sampling probability : 1
seed hash : 37836
empty? : false
ordered? : false
estimation mode? : true
theta (fraction) : 0.533299
theta (raw 64-bit) : 4918811584238243237
estimate : 10000
lower bound 95% conf : 9813.24
upper bound 95% conf : 10190.3
### End sketch summary
>>> print(b)
### Update Theta sketch summary:
lg nominal size : 12
lg current size : 13
num retained keys : 7074
resize factor : 8
sampling probability : 1
seed hash : 37836
empty? : false
ordered? : false
estimation mode? : true
theta (fraction) : 0.28443
theta (raw 64-bit) : 2623405683183212657
estimate : 24870.8
lower bound 95% conf : 24373.3
upper bound 95% conf : 25378.4
### End sketch summary
>>> for a_ in (a, a_compact):
... for b_ in (b, b_compact):
... print(datasketches.theta_a_not_b().compute(a_, b_))
...
### Compact Theta sketch summary:
num retained keys : 2178
seed hash : 37836
empty? : false
ordered? : true
estimation mode? : true
theta (fraction) : 0.28443
theta (raw 64-bit) : 2623405683183212657
estimate : 7657.41
lower bound 95% conf : 7382.58
upper bound 95% conf : 7942.37
### End sketch summary
### Compact Theta sketch summary:
num retained keys : 2178
seed hash : 37836
empty? : false
ordered? : true
estimation mode? : true
theta (fraction) : 0.28443
theta (raw 64-bit) : 2623405683183212657
estimate : 7657.41
lower bound 95% conf : 7382.58
upper bound 95% conf : 7942.37
### End sketch summary
### Compact Theta sketch summary:
num retained keys : 2178
seed hash : 37836
empty? : false
ordered? : true
estimation mode? : true
theta (fraction) : 0.28443
theta (raw 64-bit) : 2623405683183212657
estimate : 7657.41
lower bound 95% conf : 7382.58
upper bound 95% conf : 7942.37
### End sketch summary
### Compact Theta sketch summary:
num retained keys : 4635
seed hash : 37836
empty? : false
ordered? : true
estimation mode? : true
theta (fraction) : 0.28443
theta (raw 64-bit) : 2623405683183212657
estimate : 16295.7
lower bound 95% conf : 15893.5
upper bound 95% conf : 16708
### End sketch summary
```
In the 3 first cases, `a_not_b` is called with either 1 or 2 non ordered
sketches, and the result is correct.
In the last case it is called with 2 ordered sketches, and the estimate
(16295.7) is higher than the estimate of `a` (10000).
(When calling it on unordered compact sketches, or when `a` > `b`, the
results are correct)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]