Github user imatiach-msft commented on a diff in the pull request:
https://github.com/apache/spark/pull/16377#discussion_r95612086
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -161,6 +161,33 @@ class RandomForestSuite extends SparkFunSuite with
MLlibTestSparkContext {
}
}
+ test("train with empty arrays") {
+ val lp = LabeledPoint(1.0, Vectors.dense(Array.empty[Double]))
+ val data = Array.fill(5)(lp)
+ val rdd = sc.parallelize(data)
+
+ val strategy = new OldStrategy(OldAlgo.Regression, Gini, maxDepth = 2,
+ maxBins = 100)
+ intercept[IllegalArgumentException] {
+ DecisionTreeMetadata.buildMetadata(rdd, strategy)
+ }
+ }
+
+ test("train with single point") {
--- End diff --
Ideally we would have a test that would sweep over all learners and verify
small edge cases like this are consistent - eg that all learners work or fail
on an empty dataframe or a dataframe with a single row etc. I've added tests
like this before and personally found them useful, and in many cases what
seemed like a small bug in an edge case on one learner turned out to be a more
serious issue. Also, ideally instead of using synthetically generated data we
would have a suite of datasets from the field (eg UCI repository) that we could
test against all learners and get the accuracy/execution/memory characteristics
summarized in a nice visualization where they can be easily compared against
different learners and over time - this is usually an easy way to catch
regressions and validate algorithmic improvements, especially if you can view
them over long timespans (eg several months) and see how performance changes
and track down the individual change sets. Unfortunately we don't
have either of those frameworks,, and I agree this test doesn't really belong
just in this learner, so I will remove it. But if we had generic tests that
would validate all classifiers/regressors/etc in a nice way, including this
sort of test, that would be useful I think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]