Lordshinjo opened a new issue #152:
URL: https://github.com/apache/incubator-datasketches-cpp/issues/152


   When called on 2 ordered sketches, it seems that `a_not_b` is severely 
overestimating its result in some cases, with `a_not_b(a, b)` having an 
estimate bigger than `a` if `b` is bigger than `a`.
   
   Here is a small reproduction (on the current master):
   ```python
   >>> import random
   >>> import uuid
   >>> import datasketches
   >>> ids = [str(uuid.uuid4()) for _ in range(100_000)]
   >>> a_ids = random.sample(ids, 10_000)
   >>> b_ids = random.sample(ids, 25_000)
   >>> len(set(a_ids) - set(b_ids))
   7548
   >>> a = datasketches.update_theta_sketch()
   >>> b = datasketches.update_theta_sketch()
   >>> for x in a_ids:
   ...     a.update(x)
   ...
   >>> for x in b_ids:
   ...     b.update(x)
   ...
   >>> a_compact = a.compact()
   >>> b_compact = b.compact()
   >>> print(a)
   ### Update Theta sketch summary:
      lg nominal size      : 12
      lg current size      : 13
      num retained keys    : 5333
      resize factor        : 8
      sampling probability : 1
      seed hash            : 37836
      empty?               : false
      ordered?             : false
      estimation mode?     : true
      theta (fraction)     : 0.533299
      theta (raw 64-bit)   : 4918811584238243237
      estimate             : 10000
      lower bound 95% conf : 9813.24
      upper bound 95% conf : 10190.3
   ### End sketch summary
   
   >>> print(b)
   ### Update Theta sketch summary:
      lg nominal size      : 12
      lg current size      : 13
      num retained keys    : 7074
      resize factor        : 8
      sampling probability : 1
      seed hash            : 37836
      empty?               : false
      ordered?             : false
      estimation mode?     : true
      theta (fraction)     : 0.28443
      theta (raw 64-bit)   : 2623405683183212657
      estimate             : 24870.8
      lower bound 95% conf : 24373.3
      upper bound 95% conf : 25378.4
   ### End sketch summary
   
   >>> for a_ in (a, a_compact):
   ...     for b_ in (b, b_compact):
   ...             print(datasketches.theta_a_not_b().compute(a_, b_))
   ...
   ### Compact Theta sketch summary:
      num retained keys    : 2178
      seed hash            : 37836
      empty?               : false
      ordered?             : true
      estimation mode?     : true
      theta (fraction)     : 0.28443
      theta (raw 64-bit)   : 2623405683183212657
      estimate             : 7657.41
      lower bound 95% conf : 7382.58
      upper bound 95% conf : 7942.37
   ### End sketch summary
   
   ### Compact Theta sketch summary:
      num retained keys    : 2178
      seed hash            : 37836
      empty?               : false
      ordered?             : true
      estimation mode?     : true
      theta (fraction)     : 0.28443
      theta (raw 64-bit)   : 2623405683183212657
      estimate             : 7657.41
      lower bound 95% conf : 7382.58
      upper bound 95% conf : 7942.37
   ### End sketch summary
   
   ### Compact Theta sketch summary:
      num retained keys    : 2178
      seed hash            : 37836
      empty?               : false
      ordered?             : true
      estimation mode?     : true
      theta (fraction)     : 0.28443
      theta (raw 64-bit)   : 2623405683183212657
      estimate             : 7657.41
      lower bound 95% conf : 7382.58
      upper bound 95% conf : 7942.37
   ### End sketch summary
   
   ### Compact Theta sketch summary:
      num retained keys    : 4635
      seed hash            : 37836
      empty?               : false
      ordered?             : true
      estimation mode?     : true
      theta (fraction)     : 0.28443
      theta (raw 64-bit)   : 2623405683183212657
      estimate             : 16295.7
      lower bound 95% conf : 15893.5
      upper bound 95% conf : 16708
   ### End sketch summary
   ```
   
   In the 3 first cases, `a_not_b` is called with either 1 or 2 non ordered 
sketches, and the result is correct.
   In the last case it is called with 2 ordered sketches, and the estimate 
(16295.7) is higher than the estimate of `a` (10000).
   
   (When calling it on unordered compact sketches, or when `a` > `b`, the 
results are correct)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to