Github user sethah commented on a diff in the pull request:
https://github.com/apache/spark/pull/20632#discussion_r169833178
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with
MLlibTestSparkContext {
assert(split.threshold < 2020)
}
- test("Multiclass classification stump with 10-ary (ordered) categorical
features") {
--- End diff --
Regarding this test - it fails now for a silly reason. Because of the data,
the tree built winds up with a right node with equal labels of 1.0 and 2.0. It
breaks the tie by prediction 1.0, which left node also predicts. You can modify
the data generating method to:
```scala
def generateCategoricalDataPointsForMulticlassForOrderedFeatures():
Array[LabeledPoint] = {
val arr = new Array[LabeledPoint](3000)
for (i <- 0 until 3000) {
if (i < 1001) {
arr(i) = new LabeledPoint(2.0, Vectors.dense(2.0, 2.0))
} else if (i < 2000) {
arr(i) = new LabeledPoint(1.0, Vectors.dense(1.0, 2.0))
} else {
arr(i) = new LabeledPoint(1.0, Vectors.dense(2.0, 2.0))
}
}
arr
}
```
so that 2.0 will be predicted. I slightly prefer this, assuming all other
tests pass (I checked some of the suites). The less stuff we can move around
that is mostly unrelated to this change, the better.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]