Github user asolimando commented on a diff in the pull request:
https://github.com/apache/spark/pull/20632#discussion_r168961088
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
@@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with
MLlibTestSparkContext {
assert(split.threshold < 2020)
}
- test("Multiclass classification stump with 10-ary (ordered) categorical
features") {
--- End diff --
I am sorry, two out of three of the adapted tests were errouneously not
included in the PR, despite referred to in the description, my bad for not
realising until now.
I will describe what I did for those tests that I have added to my latest
commit:
- "mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala"
1. "Multiclass classification stump with 10-ary (ordered) categorical
features"
2. "Multiclass classification tree with 10-ary (ordered) categorical
features with just enough bins"
I had to "adapt" the tests as they where checking properties related to the
"learning phase" on the DecisionTree itself, but they were checking on nodes
that are not there anymore (on those test trees) due to the pruning.
So, mimicking what is done extensively in the existing test suite, I have
extracted the needed information using the "internal" methods of RandomForest
(e.g., _RandomForest.findBestSplits_) and check them directly.
So the tests have been moved to ".../ml/tree/impl/RandomForestSuite.scala"
and adapted to call the internal methods of the "ml" version of
RandomForest/DecisionTree.
Anyway, internally, the _mllib_ version of _DecisionTree_ is anyway calling
that of the _ml_ version. By looking at several other tests in
".../ml/tree/impl/RandomForestSuite.scala" ("Avoid aggregation on the last
level" test, for instance), I noticed that they were "a porting" as well, so I
have followed their style.
To double-check the adaptation I did a debugging pass for comparing the
values of the relevant objects in the two cases, and they were identical.
The only removed test was "Use soft prediction for binary classification
with ordered categorical features"
in
"mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala".
This test was a duplicate of the test having the same name and already
located in ".../ml/tree/impl/RandomForestSuite.scala",
and given that _DecisionTree_ is only a wrapper for _RandomForest_ with
_numTree=1_, they where doing exactly the same thing (modulo the call inside
the wrapper), and the coverage of the test cases is left untouched by its
removal.
"Use soft prediction for binary classification with ordered categorical
features" in ".../ml/tree/impl/RandomForestSuite.scala" had to be adapted too
for the same reasons (it was checking information in nodes that are now pruned).
PS: I didn't check extensively as it was out of scope for the PR, but I
remember that this was not the only case, there might be are redundant tests.
If you think it could be a good idea to remove them, I can re-check and open a
JIRA ticket with the gathered information.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]