Joseph K. Bradley created SPARK-10524:
-----------------------------------------

             Summary: Decision tree binary classification with ordered 
categorical features: incorrect centroid
                 Key: SPARK-10524
                 URL: https://issues.apache.org/jira/browse/SPARK-10524
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
    Affects Versions: 1.5.0
            Reporter: Joseph K. Bradley


In DecisionTree and RandomForest binary classification with ordered categorical 
features, we order categories' bins based on the hard prediction, but we should 
use the soft prediction.

Here are the 2 places in mllib and ml:
* 
[https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
* 
[https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]

The PR which fixes this should include a unit test which isolates this issue, 
ideally by directly calling binsToBestSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to