[
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-5688.
------------------------------
Resolution: Not a Problem
Closing this per Joseph's comments.
> Splits for Categorical Variables in DecisionTrees
> -------------------------------------------------
>
> Key: SPARK-5688
> URL: https://issues.apache.org/jira/browse/SPARK-5688
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.2.0
> Environment: Any
> Reporter: Eric Denovitzer
> Priority: Minor
> Labels: categorical, decisiontree
>
> The categories on each subset chosen to build a split on a categorical
> variable was not random. The categories for the subset are chosen based on
> the binary representation of a number from 1 to (2^(number of categories)) -
> 2 (excludes empty and full subset). On the current implementation, the
> integers used for the subsets are 1..numSplits. This should be random instead
> of biasing towards the categories with the lower indexes.
> Another problem is that if numBins/2 is bigger than the possible subsets
> given a category set, it still considered the numSplits to be numBins/2. This
> should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise
> the same subsets might be considered more than once when choosing the splits).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]