[GitHub] spark pull request: [SPARK-10116] [core] XORShiftRandom.hashSeed i...

srowen Mon, 02 Nov 2015 01:50:58 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8314#discussion_r43611214
  
    --- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
    @@ -588,7 +588,7 @@ class PairRDDFunctionsSuite extends SparkFunSuite with 
SharedSparkContext {
           }
           val stdev = if (withReplacement) math.sqrt(expected) else 
math.sqrt(expected * p * (1 - p))
           // Very forgiving margin since we're dealing with very small sample 
sizes most of the time
    -      math.abs(actual - expected) <= 6 * stdev
    +      math.abs(actual - expected) <= 6 * stdev + 2
    --- End diff --
    
    Really, this expression relies upon assuming that the binomial and Poisson 
distribution are well approximated by a normal distribution. When the expected 
value is in the 10s or 20s this probably isn't very true. This could be 
rewritten to properly compute the probability using PoissonDistribution and 
BinomialDistribution. However I think it would be faster to just make sure that 
the RDD size is not less than 1000 or so in the tests above. (Also, the parts 
where it computes the expected count with math.ceil are unnecessary: no reason 
to require these to be an integer, and they're another source of small errors. 
Let expected be a Double.)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10116] [core] XORShiftRandom.hashSeed i...

Reply via email to