Github user lovasoa commented on a diff in the pull request:
https://github.com/apache/spark/pull/18276#discussion_r122086218
--- Diff: core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala
---
@@ -48,22 +48,11 @@ private[spark] class CountEvaluator(totalOutputs: Int,
confidence: Double)
private[partial] object CountEvaluator {
def bound(confidence: Double, sum: Long, p: Double): BoundedDouble = {
- // Let the total count be N. A fraction p has been counted already,
with sum 'sum',
- // as if each element from the total data set had been seen with
probability p.
- val dist =
- if (sum <= 10000) {
- // The remaining count, k=N-sum, may be modeled as negative
binomial (aka Pascal),
- // where there have been 'sum' successes of probability p already.
(There are several
- // conventions, but this is the one followed by Commons Math3.)
- new PascalDistribution(sum.toInt, p)
- } else {
- // For large 'sum' (certainly, > Int.MaxValue!), use a Poisson
approximation, which has
- // a different interpretation. "sum" elements have been observed
having scanned a fraction
- // p of the data. This suggests data is counted at a rate of sum /
p across the whole data
- // set. The total expected count from the rest is distributed as
- // (1-p) Poisson(sum / p) = Poisson(sum*(1-p)/p)
- new PoissonDistribution(sum * (1 - p) / p)
- }
+ // "sum" elements have been observed having scanned a fraction
+ // p of the data. This suggests data is counted at a rate of sum / p
across the whole data
+ // set. The total expected count from the rest is distributed as
+ // (1-p) Poisson(sum / p) = Poisson(sum*(1-p)/p)
+ val dist = new PoissonDistribution(sum * (1 - p) / p)
--- End diff --
I know it is a little late for a review, but now that we have a single
distribution, it would make the code clearer to estimate directly the total
count with the poisson distribution. That is removing the `1-p` here and the
`sum + ` in the final BoundedDouble.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]