[
https://issues.apache.org/jira/browse/MATH-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350186#comment-17350186
]
Alex Herbert commented on MATH-1536:
------------------------------------
With MATH-1533 (the sensitivity of discrete sampling an enumerated probability
distribution to the order of items in the constructor) there is a case for
solving it with javadoc (it does not matter to sampling in practice) and for
solving it by making the sampling independent of the order passed to the
constructor (it allows maximum support for small probabilities).
The constructor can perform a sort on probabilities before creating the
cumulative distribution. A sort on the probabilities would allow a better use
of the precision available from an IEEE754 double. There are 2^52 doubles
between 0.5 and 1, all with the same exponent. There are 1021 exponents between
0 and 0.5. So many more doubles can be represented that are less than 0.5. So
if the probabilities are sorted from small to large the cumulative sum (and
distribution) will be able to incorporate smaller probabilities. In a worst
case sorting from large to small may result in small probabilities never being
sampled as the cumulative sum does not include them. This is in theory. In
practice sampling is usually based on a double from the RNG which is limited to
the 2^53 dyadic rationals in the range 0 to 1. So any probabilities less than
2^-53 would be ignored by sampling and the input order of the probabilities to
the constructor is less important. Any probability above 2^52 would be able to
be incorporated in the cumulative sum, irrespective of order, and any less than
2^53 would not be sampled by a uniform double.
In commons RNG there are 3 discrete probability samplers in use. The GuideTable
sampler uses a double to do an optimised search of a cumulative distribution,
and the AliasMethod sampler constructs the alias tables assuming probabilities
are fractions with a denominator of 2^53. The other look-up table sampler is
optimised for probabilities as fractions with a denominator of 2^30; the
sampler is very fast but limited in the support and we ignore it from the
discussion. So the two candidate sampler limit sampling to a minimum
probability of at least 2^-53. So input order to the sampler constructor may
prevent representation of any probability of 2^-53 if it occurs in the
cumulative sum after 0.5. Any probability of over 2^52 will be represented
irrespective of input order.
Thus there is a case that sorting the input probabilities will ensure the
maximum support for small probability values, specifically those between 2^-53
and 2^-52. However note that the likelihood that an item with a probability of
2^-53 will be sampled in use is low enough to be ignored and the effect is
minimal.
The ticket 1533 does state that this can be solved using javadoc. I do not
think a sort should be done in the sampler. This has a performance overhead for
minimal gain in the ability to effectively sample the distribution. Only
probabilities between 2^-53 and 2^-52 are effected by a sort. I would solve
this by documenting that the cumulative probability distribution is created
(and sampled from) using the input order. A different input order will create a
different sequence of samples. The samples will be reproducible with the same
RNG starting from the same RNG state given the same input order for the
distribution constructor. It should be documented to leave the sort to the user
if they critically depend on sampling being independent of the distribution
order passed in the constructor.
> Sensitivity to RNG (unit tests)
> -------------------------------
>
> Key: MATH-1536
> URL: https://issues.apache.org/jira/browse/MATH-1536
> Project: Commons Math
> Issue Type: Task
> Reporter: Gilles Sadowski
> Priority: Major
> Labels: rng, unit-test
> Fix For: 4.0
>
>
> Several unit tests fail when upgrading to version 1.3 of "Commons RNG":
> {noformat}
> [ERROR] Failures:
> [ERROR] LogitTest.testDerivativesWithInverseFunction:195 maxOrder = 2
> expected:<0.0> but was:<1.0658141036401503E-14>
> [ERROR] EnumeratedIntegerDistributionTest.testMath1533:196
> [ERROR] EnumeratedIntegerDistributionTest.testSample:174 expected:<7.84>
> but was:<7.857073891264003>
> [ERROR] MiniBatchKMeansClustererTest.testCompareToKMeans:86 Different score
> ratio 46.645378%!, diff points ratio: 34.716981%
> [ERROR] CalinskiHarabaszTest.test_compare_to_skLearn:102
> expected:<597.7763150683217> but was:<559.2829020672648>
> [ERROR] MultiStartMultivariateOptimizerTest.testCircleFitting:76
> expected:<69.9597> but was:<69.96228624385736>
> [ERROR] MultiStartMultivariateOptimizerTest.testRosenbrock:114 numEval=873
> [ERROR] GaussianRandomGeneratorTest.testMeanAndStandardDeviation:37
> expected:<1.0> but was:<0.9715310171501561>
> [ERROR] NaturalRankingTest.testNaNsFixedTiesRandom:227 Array comparison
> failure
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)