atronchi opened a new pull request #26197: Implement p-value simulation and 
unit tests for chi2 test
URL: https://github.com/apache/spark/pull/26197
 
 
   ### What changes were proposed in this pull request?
   This PR implements monte-carlo simulation of p-values for the ChiSqTest in 
mllib. For other implementations, see the following references:
   * 
https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/chisq.test
   * https://en.wikipedia.org/wiki/Generalized_p-value
   
   ### Why are the changes needed?
   While monte-carlo simulation is a common approach to estimate p-values, a 
robust scalable implementation in Spark was non-trivial, so we hope others can 
re-use these efforts.
   
   ### Does this PR introduce any user-facing change?
   We provide a new boolean parameter `simulatePValue` to the ChiSqTest so that 
users can request p-value simulation, and also an integer parameter `numDraw` 
so that users can specify the number of draws to take. The `getChi2Digest` 
method is also exposed in case users find value in the digest object itself 
which allows extraction of arbitrary quantiles, cdf, etc.
   
   ### How was this patch tested?
   This PR also implements the `ChiSqTestSuite` with some tests to verify that 
both the ChiSqTest itself and the new p-value simulation are working correctly 
by evaluating that test cases expected to pass and fail a chi squared test 
actually work as expected. 
   
   We ran these tests with the following results: 
   ```
   $ build/mvn package -pl mllib -Dtest=none 
-DwildcardSuites=org.apache.spark.mllib.stat.test.ChiSqTestSuite
   ...
   ChiSqTestSuite:
   - theoretical chi2 test
   - simulated/empirical chi2 test
   Run completed in 1 minute, 22 seconds.
   Total number of tests run: 2
   Suites: completed 2, aborted 0
   Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ...
   [INFO] BUILD SUCCESS
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to