Joseph K. Bradley created SPARK-10524:
-----------------------------------------
Summary: Decision tree binary classification with ordered
categorical features: incorrect centroid
Key: SPARK-10524
URL: https://issues.apache.org/jira/browse/SPARK-10524
Project: Spark
Issue Type: Bug
Components: ML, MLlib
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
In DecisionTree and RandomForest binary classification with ordered categorical
features, we order categories' bins based on the hard prediction, but we should
use the soft prediction.
Here are the 2 places in mllib and ml:
*
[https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
*
[https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]
The PR which fixes this should include a unit test which isolates this issue,
ideally by directly calling binsToBestSplit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]