[ 
https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628671#comment-14628671
 ] 

Joseph K. Bradley commented on SPARK-9075:
------------------------------------------

I agree it's not a bug, but it is a requirement which we could relax.  The 
requirement is pretty reasonable IMO.  If you have numExamples < 
maxCategoriesPerFeature, then you will know nothing about some of your feature 
values and will have to make arbitrary decisions about how to split that 
feature.

Possible solutions include:
* Ignoring categorical features if we cannot collect enough info about them  
(This is not a usual practice in popular ML libraries, so I would not recommend 
it.)
* For high-arity categorical features, we could create a small number of bins 
which randomly group categories.  This would be a more reasonable solution, but 
I think it's low-priority since random groupings of many categories would 
provide very little information.

In the meantime, I'd recommend we improve the error message to tell the user to 
remove a feature X when there is too little data to make use of X.  I'll close 
this, but will create & link a JIRA for improving the error message.

> DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect. 
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-9075
>                 URL: https://issues.apache.org/jira/browse/SPARK-9075
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.4.0
>            Reporter: Les Selecky
>            Priority: Minor
>
> In 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
>  there's a statement that sets maxPossibileBins to numExamples when 
> numExamples is less than strategy.maxBins. 
> This can cause an error when training small partitions; the error is 
> triggered further down in the logic where it's required that 
> maxCategoriesPerFeature be less than or equal to maxPossibleBins.
> Here's the an example of how it was manifested: the partition contained 49 
> rows (i.e., numExamples=49 but strategy.maxBins was 57.
> The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore 
> reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= 
> maxPossibleBins" to throw an error.
> In short, this will be a problem when training small datasets with a feature 
> that contains more categories than numExamples.
> In our local testing we commented out the "math.min(strategy.maxBins, 
> numExamples)" line and the decision tree succeeded where it had failed 
> previously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to