[jira] [Commented] (MATH-1536) Sensitivity to RNG (unit tests)

Alex Herbert (Jira) Fri, 21 May 2021 15:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/MATH-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17349501#comment-17349501
 ]


Alex Herbert commented on MATH-1536:
------------------------------------

In short, if the tests use random seeds then try some different seeds and/or 
RNGs.

The long version is that the test assumptions may be wrong.

To check the test robustness the test can be seeded 100 times and checked how 
often it fails. If it is failing a lot then there is an issue with the test. 
Typically the tolerances are not enough to allow for the variation in test 
data. I would expect a test to fail with a p-value of 0.01 or 0.001 with repeat 
seeding. It can then be left hard coded with a seed to make the test suite 
robust. If these tests cannot be fixed with a few other seeds then they require 
further investigation.

Some of the errors look like the assertion tolerance for equality can be 
updated to be a bit more lenient. Some appear to fail by a larger amount and 
others are too cryptic to understand from the above. I would have to look at 
the tests.

Also note that in commons-rng v1.3 as well as providing an extra bit of 
randomness for nextDouble(), the base RNG implementation was changed for 
nextInt(int) to a faster algorithm. So any tests using nextDouble or 
nextInt(int) will be creating different data.

Picking an easy one from above: EnumeratedIntegerDistributionTest.testSample. 
This can be rewritten as:
{code:java}
@Test
public void loop_testSample() {
    int fails = 0;
    int loops = 1000;
    for (int i = 0; i < loops; i++) {
        try {
            testSample();
        } catch (AssertionError ignore) {
            fails++;
        }
    }
    Assert.assertTrue(String.format("Failed %d / %d", fails, loops), (double) 
fails / loops < 0.01);
    System.out.printf("Failed %d / %d%n", fails, loops);
}

// Comment out for repeat testing
//@Test
public void testSample() {

}{code}
{noformat}
mvn test -Dtest=EnumeratedIntegerDistributionTest#loop_testSample
{noformat}
It took a long time to run and finished with output (after several runs):
{noformat}
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 210 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 203 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 181 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 211 / 1000
{noformat}
Which is consistent. Underlying this sampler is a GuideTableDiscreteSampler. 
This is new for v1.3. It may be the culprit. I changed the sampler to use an 
AliasMethodDiscreteSampler. Results are:
{noformat}
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 210 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 228 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 198 / 1000
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 211 / 1000
{noformat}
So the with a different method the test still fails 20% of the time. This seems 
like the test is wrong. It fails on the test of the variance:
{code:java}
Assert.assertEquals(testDistribution.getVariance(),
        sumOfSquares / n - FastMath.pow(sum / n, 2), 1e-2);
{code}
The test distribution has a variance of 7.84. The tolerance is 0.01. A 
confidence interval for the variance would be:
{noformat}
[ (n - 1)s2] / B < σ2 < [ (n - 1)s2] / A.

Here n is the sample size, s2 is the sample variance. The number A is the point 
of the chi-square distribution with n -1 degrees of freedom at which exactly 
α/2 of the area under the curve is to the left of A. In a similar way, the 
number B is the point of the same chi-square distribution with exactly α/2 of 
the area under the curve to the right of B.
{noformat}
So with n=1000000 and s2 as the actual variance a confidence interval for 
p=0.01 or 0.05 would use an interval of (using matlab code):
{noformat}
[1000000 * 7.84 / chi2inv(0.995,999999), 1000000 * 7.84 / chi2inv(0.005,999999)]

= [7.8115, 7.8686]

[7.84 - 1000000 * 7.84 / chi2inv(0.975,999999), 1000000 * 7.84 / 
chi2inv(0.025,999999) - 7.84]

= [7.8183, 7.8618]
{noformat}
So a correct delta for the tolerance would be 7.8686 - 7.84 = 0.0286 (p=0.01). 
If I put this into the test then the failure rate is lower as expected:
{noformat}
Failed 0 / 1000
Failed 0 / 1000
Failed 3 / 1000
Failed 1 / 1000
{noformat}
But we expect failures to be 10 / 1000. So perhaps my statistics are not 
correct. I changed the delta to 7.8618 - 7.84 = 0.0218 (p=0.05):
{noformat}
EnumeratedIntegerDistributionTest.loop_testSample:165 Failed 11 / 1000
Failed 6 / 1000
Failed 3 / 1000
Failed 7 / 1000
{noformat}
Here we expect failures to be 50 / 1000. But again the failures are lower than 
expected. This may be due to the nature of the data being sampled. The 
distribution only has 4 values (of which there are 3 distinct values): 3, -1, 
3, 7 with probabilities 0.2, 0.2, 0.3, 0.3. It may be there are not enough 
values to sufficiently test the distribution. What is clear is that in this 
test the existing tolerance of 0.01 for the variance is too low. A value based 
on the expected confidence interval of the variance is more robust.

The other tests that are failing would need investigation using a similar 
method of repeats to highlight the robustness of the test. I may look at those 
when I have a bit more time.

 

> Sensitivity to RNG (unit tests)
> -------------------------------
>
>                 Key: MATH-1536
>                 URL: https://issues.apache.org/jira/browse/MATH-1536
>             Project: Commons Math
>          Issue Type: Task
>            Reporter: Gilles Sadowski
>            Priority: Major
>              Labels: rng, unit-test
>             Fix For: 4.0
>
>
> Several unit tests fail when upgrading to version 1.3 of "Commons RNG":
> {noformat}
> [ERROR] Failures: 
> [ERROR]   LogitTest.testDerivativesWithInverseFunction:195 maxOrder = 2 
> expected:<0.0> but was:<1.0658141036401503E-14>
> [ERROR]   EnumeratedIntegerDistributionTest.testMath1533:196
> [ERROR]   EnumeratedIntegerDistributionTest.testSample:174 expected:<7.84> 
> but was:<7.857073891264003>
> [ERROR]   MiniBatchKMeansClustererTest.testCompareToKMeans:86 Different score 
> ratio 46.645378%!, diff points ratio: 34.716981%
> [ERROR]   CalinskiHarabaszTest.test_compare_to_skLearn:102 
> expected:<597.7763150683217> but was:<559.2829020672648>
> [ERROR]   MultiStartMultivariateOptimizerTest.testCircleFitting:76 
> expected:<69.9597> but was:<69.96228624385736>
> [ERROR]   MultiStartMultivariateOptimizerTest.testRosenbrock:114 numEval=873
> [ERROR]   GaussianRandomGeneratorTest.testMeanAndStandardDeviation:37 
> expected:<1.0> but was:<0.9715310171501561>
> [ERROR]   NaturalRankingTest.testNaNsFixedTiesRandom:227 Array comparison 
> failure
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MATH-1536) Sensitivity to RNG (unit tests)

Reply via email to