[
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley updated SPARK-10788:
--------------------------------------
Description:
Decision trees in spark.ml (RandomForest.scala) communicate twice as much data
as needed for unordered categorical features. Here's an example.
Say there are 3 categories A, B, C. We consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we collect statistics for each of the 6 subsets of categories (3 * 2
= 6). However, we could instead collect statistics for the 3 subsets on the
left-hand side of the 3 possible splits: A and A,B and A,C. If we also have
stats for the entire node, then we can compute the stats for the 3 subsets on
the right-hand side of the splits. In pseudomath: {{stats(B,C) = stats(A,B,C) -
stats(A)}}.
We should eliminate these extra bins within the spark.ml implementation since
the spark.mllib implementation will be removed before long (and will instead
call into spark.ml).
was:
Decision trees in spark.ml (RandomForest.scala) effectively creates a second
copy of each split. E.g., if there are 3 categories A, B, C, then we should
consider 3 splits:
* A vs. B, C
* A, B vs. C
* A, C vs. B
Currently, we also consider the 3 flipped splits:
* B,C vs. A
* C vs. A, B
* B vs. A, C
This means we communicate twice as much data as needed for these features.
We should eliminate these duplicate splits within the spark.ml implementation
since the spark.mllib implementation will be removed before long (and will
instead call into spark.ml).
> Decision Tree duplicates bins for unordered categorical features
> ----------------------------------------------------------------
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much
> data as needed for unordered categorical features. Here's an example.
> Say there are 3 categories A, B, C. We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 *
> 2 = 6). However, we could instead collect statistics for the 3 subsets on
> the left-hand side of the 3 possible splits: A and A,B and A,C. If we also
> have stats for the entire node, then we can compute the stats for the 3
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) =
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since
> the spark.mllib implementation will be removed before long (and will instead
> call into spark.ml).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]