[
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley updated SPARK-10788:
--------------------------------------
Priority: Minor (was: Major)
> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Joseph K. Bradley
> Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much
> data as needed for unordered categorical features. Here's an example.
> Say there are 3 categories A, B, C. We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 *
> 2 = 6). However, we could instead collect statistics for the 3 subsets on
> the left-hand side of the 3 possible splits: A and A,B and A,C. If we also
> have stats for the entire node, then we can compute the stats for the 3
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) =
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since
> the spark.mllib implementation will be removed before long (and will instead
> call into spark.ml).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]