[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

asolimando Sun, 18 Feb 2018 10:45:16 -0800

Github user asolimando commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20632#discussion_r168961088
  
    --- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -303,26 +303,6 @@ class DecisionTreeSuite extends SparkFunSuite with 
MLlibTestSparkContext {
         assert(split.threshold < 2020)
       }
     
    -  test("Multiclass classification stump with 10-ary (ordered) categorical 
features") {
    --- End diff --
    
    I am sorry, two out of three of the adapted tests were errouneously not 
included in the PR, despite referred to in the description, my bad for not 
realising until now. 
    
    I will describe what I did for those tests that I have added to my latest 
commit:
    
    - "mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala"
      1. "Multiclass classification stump with 10-ary (ordered) categorical 
features"
      2. "Multiclass classification tree with 10-ary (ordered) categorical 
features with just enough bins"
    
    I had to "adapt" the tests as they where checking properties related to the 
"learning phase" on the DecisionTree itself, but they were checking on nodes 
that are not there anymore (on those test trees) due to the pruning.
    
    So, mimicking what is done extensively in the existing test suite, I have 
extracted the needed information using the "internal" methods of RandomForest 
(e.g., _RandomForest.findBestSplits_) and check them directly.
    
    So the tests have been moved to ".../ml/tree/impl/RandomForestSuite.scala" 
and adapted to call the internal methods of the "ml" version of 
RandomForest/DecisionTree.
    
    Anyway, internally, the _mllib_ version of _DecisionTree_ is anyway calling 
that of the _ml_ version. By looking at several other tests in 
".../ml/tree/impl/RandomForestSuite.scala" ("Avoid aggregation on the last 
level" test, for instance), I noticed that they were "a porting" as well, so I 
have followed their style.
    
    To double-check the adaptation I did a debugging pass for comparing the 
values of the relevant objects in the two cases, and they were identical. 
    
    The only removed test was "Use soft prediction for binary classification 
with ordered categorical features" 
    in 
"mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala".
    
    This test was a duplicate of the test having the same name and already 
located in ".../ml/tree/impl/RandomForestSuite.scala", 
    and given that _DecisionTree_ is only a wrapper for _RandomForest_ with 
_numTree=1_, they where doing exactly the same thing (modulo the call inside 
the wrapper), and the coverage of the test cases is left untouched by its 
removal.
    
    "Use soft prediction for binary classification with ordered categorical 
features" in ".../ml/tree/impl/RandomForestSuite.scala" had to be adapted too 
for the same reasons (it was checking information in nodes that are now pruned).
    
    PS: I didn't check extensively as it was out of scope for the PR, but I 
remember that this was not the only case, there might be are redundant tests. 
If you think it could be a good idea to remove them, I can re-check and open a 
JIRA ticket with the gathered information.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20632: [SPARK-3159] added subtree pruning in the transla...

Reply via email to