[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

JoshRosen Fri, 29 Aug 2014 15:01:45 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2142#issuecomment-53935085
  
    To check my understanding of the error-correction code:
    
    - Due to hash collisions, we may underestimate the true number of distinct 
items.
    - Given a set of `k` random 32-bit hashes, the exact probability of _at 
least one_ collision ([from your 
link](http://preshing.com/20110504/hash-collision-probabilities/)) is _1 - 
e^(-k(k-1)/2*2^32)_.  We can approximate _1 - e^X_ by _X_ for small _X_.    
Therefore, the approximate probability of some hash collision is 
_-k(k-1)/2*2^32_, which is roughly _k^2/2^33_.
    
    I'm a bit confused about how the current correction term works:
    
    ```python
    c = - sys.maxint * log(1 - float(c) / sys.maxint)
    ```
    
    It looks like this is correcting for _overestimates_ of the number of 
distinct elements by subtracting a term based on the collision probability.  In 
general, won't collisions cause us underestimate instead of overestimating?
    
    Maybe we should approach this by treating the true number of distinct items 
(_k_) as a random variable and figuring out the maximum likelihood estimator of 
_k_ given an observation _c_ of the result of `countApproxDistinct`.
    
    Before we consider that, though, I wonder whether we even need a correction 
term.  Doesn't the Java implementation of `countApproxDistinct` already 
introduce hashing errors that are corrected for in its implementation?  I don't 
think that the two levels of hashing will introduce more error, since I think 
the hashcode of a Java integer should just be its value.
    
    _Note_: I'm not a statistician; please correct me if I've gotten anything 
wrong.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-2871] [PySpark] add countApproxDistinct...

Reply via email to