Github user josepablocam commented on a diff in the pull request:
https://github.com/apache/spark/pull/6994#discussion_r34079552
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala
---
@@ -158,4 +158,25 @@ object Statistics {
def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
ChiSqTest.chiSquaredFeatures(data)
}
+
+ /**
+ * Conduct a one-sample, two sided Kolmogorov Smirnov test for
probability distribution equality
+ * @param data an `RDD[Double]` containing the sample of data to test
+ * @param cdf a `Double => Double` function to calculate the theoretical
CDF at a given value
+ * @return KSTestResult object containing test statistic, p-value, and
null hypothesis.
+ */
+ def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
+ KSTest.testOneSample(data, cdf)
+ }
+
+ /**
+ * Convenience function to conduct a one-sample, two sided Kolmogorov
Smirnov test for probability
+ * distribution equality. Currently supports standard normal
distribution only.
+ * @param data an `RDD[Double]` containing the sample of data to test
+ * @param name a `String` name for a theoretical distribution
--- End diff --
Yes, on point 1, if I'm understanding correctly: my only issue at the time
was that recreating the distribution object for each observation whenever the
cdf function was called (vs once in a partition, and reusing that object for
each observation in that partition) seemed inefficient to me (but I can of
course be wrong on this point).
You're right on distributions in math3 being serializable (seems that is
another difference between 3.4.1 and 3.1.1). So perhaps a better approach is to
simply have the API take a RealDistribution directly? (and eliminate the name
system, but keep the Double=>Double option as well, in case users want
something that is not implemented in math3).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]