[jira] [Resolved] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

Sean Owen (JIRA) Tue, 17 Feb 2015 07:18:07 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen resolved SPARK-5688.
------------------------------
    Resolution: Not a Problem

Closing this per Joseph's comments.

> Splits for Categorical Variables in DecisionTrees
> -------------------------------------------------
>
>                 Key: SPARK-5688
>                 URL: https://issues.apache.org/jira/browse/SPARK-5688
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Any
>            Reporter: Eric Denovitzer
>            Priority: Minor
>              Labels: categorical, decisiontree
>
> The categories on each subset chosen to build a split on a categorical 
> variable  was not random. The categories for the subset are chosen based on 
> the binary representation of a number from 1 to (2^(number of categories)) - 
> 2 (excludes empty and full subset). On the current implementation, the 
> integers used for the subsets are 1..numSplits. This should be random instead 
> of biasing towards the categories with the lower indexes. 
> Another problem is that if numBins/2 is bigger than the possible subsets 
> given a category set, it still considered the numSplits to be numBins/2. This 
> should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
> the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-5688) Splits for Categorical Variables in DecisionTrees

Reply via email to