Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/2455#discussion_r18124373
--- Diff:
core/src/main/scala/org/apache/spark/util/random/RandomSampler.scala ---
@@ -43,9 +46,34 @@ trait RandomSampler[T, U] extends Pseudorandom with
Cloneable with Serializable
throw new NotImplementedError("clone() is not implemented.")
}
+private [spark]
+object RandomSampler {
+ // Default random number generator used by random samplers
+ def rngDefault: Random = new XORShiftRandom
+
+ // Default gap sampling maximum
+ // For sampling fractions <= this value, the gap sampling optimization
will be applied.
+ // Above this value, it is assumed that "tradtional" bernoulli sampling
is faster. The
+ // optimal value for this will depend on the RNG. More expensive RNGs
will tend to make
+ // the optimal value higher. The most reliable way to determine this
value for a given RNG
+ // is to experiment. I would expect a value of 0.5 to be close in most
cases.
+ def gsmDefault: Double = 0.4
+
+ // Default gap sampling epsilon
+ // When sampling random floating point values the gap sampling logic
requires value > 0. An
+ // optimal value for this parameter is at or near the minimum positive
floating point value
+ // returned by nextDouble() for the RNG being used.
+ def epsDefault: Double = 5e-11
--- End diff --
This is quite minor and tangential, but, is it clearer to write doubles
with a `.0`? and to omit the type of the definition?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]