[
https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631931#comment-14631931
]
Joseph K. Bradley commented on SPARK-9075:
------------------------------------------
I agree there are ways to deal with very high-arity categories, but I think
dealing with it is lower priority than some other improvements (such as
providing predicted class probabilities) which we're working on. In general,
one should throw out that high-arity categorical feature, if you have so few
examples.
It's true the check does not ensure all values are covered; that would be good
to refine in the future.
It sounds like we're discussing 3 possibilities, 2 short-term and 1 long-term:
* Run without exceptions no matter what is given.
** Short-term: Run as is. This could mean giving meaningless results.
** Long-term: We should implement a better way to handle many categories.
* Short-term: Throw exception and notify user of the problem. I prefer this
for now, until we can do the long-term solution.
> DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.
> ----------------------------------------------------------------------------
>
> Key: SPARK-9075
> URL: https://issues.apache.org/jira/browse/SPARK-9075
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.4.0
> Reporter: Les Selecky
> Priority: Minor
>
> In
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
> there's a statement that sets maxPossibileBins to numExamples when
> numExamples is less than strategy.maxBins.
> This can cause an error when training small partitions; the error is
> triggered further down in the logic where it's required that
> maxCategoriesPerFeature be less than or equal to maxPossibleBins.
> Here's the an example of how it was manifested: the partition contained 49
> rows (i.e., numExamples=49 but strategy.maxBins was 57.
> The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore
> reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <=
> maxPossibleBins" to throw an error.
> In short, this will be a problem when training small datasets with a feature
> that contains more categories than numExamples.
> In our local testing we commented out the "math.min(strategy.maxBins,
> numExamples)" line and the decision tree succeeded where it had failed
> previously.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]