Joseph K. Bradley created SPARK-3207:
----------------------------------------
Summary: Choose splits for continuous features in DecisionTree
more adaptively
Key: SPARK-3207
URL: https://issues.apache.org/jira/browse/SPARK-3207
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor
DecisionTree splits on continuous features by choosing an array of values from
a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could
end up having multiple copies of the same split. This is not an error, but it
could be improved to be more adaptive to the data.
Proposal: In findSplitsBins, check for identical values, and do some searching
in order to find a set of unique splits. Reduce the number of splits if there
are not enough unique candidates.
This would require modifying findSplitsBins and making sure that the number of
splits/bins (chosen adaptively) is set correctly elsewhere in the code (such as
in DecisionTreeMetadata).
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]