Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/8314#discussion_r43611214
--- Diff:
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -588,7 +588,7 @@ class PairRDDFunctionsSuite extends SparkFunSuite with
SharedSparkContext {
}
val stdev = if (withReplacement) math.sqrt(expected) else
math.sqrt(expected * p * (1 - p))
// Very forgiving margin since we're dealing with very small sample
sizes most of the time
- math.abs(actual - expected) <= 6 * stdev
+ math.abs(actual - expected) <= 6 * stdev + 2
--- End diff --
Really, this expression relies upon assuming that the binomial and Poisson
distribution are well approximated by a normal distribution. When the expected
value is in the 10s or 20s this probably isn't very true. This could be
rewritten to properly compute the probability using PoissonDistribution and
BinomialDistribution. However I think it would be faster to just make sure that
the RDD size is not less than 1000 or so in the tests above. (Also, the parts
where it computes the expected count with math.ceil are unnecessary: no reason
to require these to be an integer, and they're another source of small errors.
Let expected be a Double.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]