Joseph K. Bradley created SPARK-3207:
----------------------------------------

             Summary: Choose splits for continuous features in DecisionTree 
more adaptively
                 Key: SPARK-3207
                 URL: https://issues.apache.org/jira/browse/SPARK-3207
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Joseph K. Bradley
            Priority: Minor


DecisionTree splits on continuous features by choosing an array of values from 
a subsample of the data.

Currently, it does not check for identical values in the subsample, so it could 
end up having multiple copies of the same split.  This is not an error, but it 
could be improved to be more adaptive to the data.

Proposal: In findSplitsBins, check for identical values, and do some searching 
in order to find a set of unique splits.  Reduce the number of splits if there 
are not enough unique candidates.

This would require modifying findSplitsBins and making sure that the number of 
splits/bins (chosen adaptively) is set correctly elsewhere in the code (such as 
in DecisionTreeMetadata).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to