Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/2142#issuecomment-53935085
To check my understanding of the error-correction code:
- Due to hash collisions, we may underestimate the true number of distinct
items.
- Given a set of `k` random 32-bit hashes, the exact probability of _at
least one_ collision ([from your
link](http://preshing.com/20110504/hash-collision-probabilities/)) is _1 -
e^(-k(k-1)/2*2^32)_. We can approximate _1 - e^X_ by _X_ for small _X_.
Therefore, the approximate probability of some hash collision is
_-k(k-1)/2*2^32_, which is roughly _k^2/2^33_.
I'm a bit confused about how the current correction term works:
```python
c = - sys.maxint * log(1 - float(c) / sys.maxint)
```
It looks like this is correcting for _overestimates_ of the number of
distinct elements by subtracting a term based on the collision probability. In
general, won't collisions cause us underestimate instead of overestimating?
Maybe we should approach this by treating the true number of distinct items
(_k_) as a random variable and figuring out the maximum likelihood estimator of
_k_ given an observation _c_ of the result of `countApproxDistinct`.
Before we consider that, though, I wonder whether we even need a correction
term. Doesn't the Java implementation of `countApproxDistinct` already
introduce hashing errors that are corrected for in its implementation? I don't
think that the two levels of hashing will introduce more error, since I think
the hashcode of a Java integer should just be its value.
_Note_: I'm not a statistician; please correct me if I've gotten anything
wrong.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]