[ 
https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631605#comment-14631605
 ] 

Neetu Verma commented on SPARK-9075:
------------------------------------

Thanks for your response. We agree that in most of the ML libraries it's not 
recommended to ignore the missing values. However, for Decision Trees, if any 
of the categorical features are missing from the examples, there are known 
techniques for this kind of scenario. Assigning a weight to each value 
according to its frequency among all of the examples, replacing missing values 
with a common value, and pruning are a few examples. We referred to Russell and 
Norvig's book "Artificial Intelligence A Modern Approach" to read on this; we 
do understand that the results might be ambiguous with missing categorical 
values but don't think it should fail as an exception. Also, to further 
elaborate, the issue we are referring to is where code is computing max 
possible bins based on the max number of bins and number of examples. This code 
will not ensure if the distinct values for each of the categorical features 
exist in the examples. It's just a quick check but it does not ensure all 
values are covered. 
In a scenario where max categorical features are 50 and the number of examples 
are 100, there won't be any exception, even though there is still the 
possibility that some of the feature's categorical values don't exist in the 
examples. But in the scenario where max categorical features are 50 and the 
number of examples are 49, an exception will be thrown. In both scenarios there 
can be missing categorical values, so the below check in the code is not 
guaranteed to catch the situation with missing values in the examples.
        maxPossibleBins = math.min(strategy.maxBins, numExamples)

Please let us know your thoughts.

> DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect. 
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-9075
>                 URL: https://issues.apache.org/jira/browse/SPARK-9075
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.4.0
>            Reporter: Les Selecky
>            Priority: Minor
>
> In 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
>  there's a statement that sets maxPossibileBins to numExamples when 
> numExamples is less than strategy.maxBins. 
> This can cause an error when training small partitions; the error is 
> triggered further down in the logic where it's required that 
> maxCategoriesPerFeature be less than or equal to maxPossibleBins.
> Here's the an example of how it was manifested: the partition contained 49 
> rows (i.e., numExamples=49 but strategy.maxBins was 57.
> The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore 
> reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <= 
> maxPossibleBins" to throw an error.
> In short, this will be a problem when training small datasets with a feature 
> that contains more categories than numExamples.
> In our local testing we commented out the "math.min(strategy.maxBins, 
> numExamples)" line and the decision tree succeeded where it had failed 
> previously.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to