atronchi opened a new pull request #26197: Implement p-value simulation and unit tests for chi2 test URL: https://github.com/apache/spark/pull/26197 ### What changes were proposed in this pull request? This PR implements monte-carlo simulation of p-values for the ChiSqTest in mllib. For other implementations, see the following references: * https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/chisq.test * https://en.wikipedia.org/wiki/Generalized_p-value ### Why are the changes needed? While monte-carlo simulation is a common approach to estimate p-values, a robust scalable implementation in Spark was non-trivial, so we hope others can re-use these efforts. ### Does this PR introduce any user-facing change? We provide a new boolean parameter `simulatePValue` to the ChiSqTest so that users can request p-value simulation, and also an integer parameter `numDraw` so that users can specify the number of draws to take. The `getChi2Digest` method is also exposed in case users find value in the digest object itself which allows extraction of arbitrary quantiles, cdf, etc. ### How was this patch tested? This PR also implements the `ChiSqTestSuite` with some tests to verify that both the ChiSqTest itself and the new p-value simulation are working correctly by evaluating that test cases expected to pass and fail a chi squared test actually work as expected. We ran these tests with the following results: ``` $ build/mvn package -pl mllib -Dtest=none -DwildcardSuites=org.apache.spark.mllib.stat.test.ChiSqTestSuite ... ChiSqTestSuite: - theoretical chi2 test - simulated/empirical chi2 test Run completed in 1 minute, 22 seconds. Total number of tests run: 2 Suites: completed 2, aborted 0 Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ... [INFO] BUILD SUCCESS ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
