[ 
https://issues.apache.org/jira/browse/MATH-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350186#comment-17350186
 ] 

Alex Herbert commented on MATH-1536:
------------------------------------

With MATH-1533 (the sensitivity of discrete sampling an enumerated probability 
distribution to the order of items in the constructor) there is a case for 
solving it with javadoc (it does not matter to sampling in practice) and for 
solving it by making the sampling independent of the order passed to the 
constructor (it allows maximum support for small probabilities).

The constructor can perform a sort on probabilities before creating the 
cumulative distribution. A sort on the probabilities would allow a better use 
of the precision available from an IEEE754 double. There are 2^52 doubles 
between 0.5 and 1, all with the same exponent. There are 1021 exponents between 
0 and 0.5. So many more doubles can be represented that are less than 0.5. So 
if the probabilities are sorted from small to large the cumulative sum (and 
distribution) will be able to incorporate smaller probabilities. In a worst 
case sorting from large to small may result in small probabilities never being 
sampled as the cumulative sum does not include them. This is in theory. In 
practice sampling is usually based on a double from the RNG which is limited to 
the 2^53 dyadic rationals in the range 0 to 1. So any probabilities less than 
2^-53 would be ignored by sampling and the input order of the probabilities to 
the constructor is less important. Any probability above 2^52 would be able to 
be incorporated in the cumulative sum, irrespective of order, and any less than 
2^53 would not be sampled by a uniform double.

In commons RNG there are 3 discrete probability samplers in use. The GuideTable 
sampler uses a double to do an optimised search of a cumulative distribution, 
and the AliasMethod sampler constructs the alias tables assuming probabilities 
are fractions with a denominator of 2^53. The other look-up table sampler is 
optimised for probabilities as fractions with a denominator of 2^30; the 
sampler is very fast but limited in the support and we ignore it from the 
discussion. So the two candidate sampler limit sampling to a minimum 
probability of at least 2^-53. So input order to the sampler constructor may 
prevent representation of any probability of 2^-53 if it occurs in the 
cumulative sum after 0.5. Any probability of over 2^52 will be represented 
irrespective of input order.

Thus there is a case that sorting the input probabilities will ensure the 
maximum support for small probability values, specifically those between 2^-53 
and 2^-52. However note that the likelihood that an item with a probability of 
2^-53 will be sampled in use is low enough to be ignored and the effect is 
minimal.

 

The ticket 1533 does state that this can be solved using javadoc. I do not 
think a sort should be done in the sampler. This has a performance overhead for 
minimal gain in the ability to effectively sample the distribution. Only 
probabilities between 2^-53 and 2^-52 are effected by a sort. I would solve 
this by documenting that the cumulative probability distribution is created 
(and sampled from) using the input order. A different input order will create a 
different sequence of samples. The samples will be reproducible with the same 
RNG starting from the same RNG state given the same input order for the 
distribution constructor. It should be documented to leave the sort to the user 
if they critically depend on sampling being independent of the distribution 
order passed in the constructor.

 

> Sensitivity to RNG (unit tests)
> -------------------------------
>
>                 Key: MATH-1536
>                 URL: https://issues.apache.org/jira/browse/MATH-1536
>             Project: Commons Math
>          Issue Type: Task
>            Reporter: Gilles Sadowski
>            Priority: Major
>              Labels: rng, unit-test
>             Fix For: 4.0
>
>
> Several unit tests fail when upgrading to version 1.3 of "Commons RNG":
> {noformat}
> [ERROR] Failures: 
> [ERROR]   LogitTest.testDerivativesWithInverseFunction:195 maxOrder = 2 
> expected:<0.0> but was:<1.0658141036401503E-14>
> [ERROR]   EnumeratedIntegerDistributionTest.testMath1533:196
> [ERROR]   EnumeratedIntegerDistributionTest.testSample:174 expected:<7.84> 
> but was:<7.857073891264003>
> [ERROR]   MiniBatchKMeansClustererTest.testCompareToKMeans:86 Different score 
> ratio 46.645378%!, diff points ratio: 34.716981%
> [ERROR]   CalinskiHarabaszTest.test_compare_to_skLearn:102 
> expected:<597.7763150683217> but was:<559.2829020672648>
> [ERROR]   MultiStartMultivariateOptimizerTest.testCircleFitting:76 
> expected:<69.9597> but was:<69.96228624385736>
> [ERROR]   MultiStartMultivariateOptimizerTest.testRosenbrock:114 numEval=873
> [ERROR]   GaussianRandomGeneratorTest.testMeanAndStandardDeviation:37 
> expected:<1.0> but was:<0.9715310171501561>
> [ERROR]   NaturalRankingTest.testNaNsFixedTiesRandom:227 Array comparison 
> failure
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to