[
https://issues.apache.org/jira/browse/SPARK-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631605#comment-14631605
]
Neetu Verma commented on SPARK-9075:
------------------------------------
Thanks for your response. We agree that in most of the ML libraries it's not
recommended to ignore the missing values. However, for Decision Trees, if any
of the categorical features are missing from the examples, there are known
techniques for this kind of scenario. Assigning a weight to each value
according to its frequency among all of the examples, replacing missing values
with a common value, and pruning are a few examples. We referred to Russell and
Norvig's book "Artificial Intelligence A Modern Approach" to read on this; we
do understand that the results might be ambiguous with missing categorical
values but don't think it should fail as an exception. Also, to further
elaborate, the issue we are referring to is where code is computing max
possible bins based on the max number of bins and number of examples. This code
will not ensure if the distinct values for each of the categorical features
exist in the examples. It's just a quick check but it does not ensure all
values are covered.
In a scenario where max categorical features are 50 and the number of examples
are 100, there won't be any exception, even though there is still the
possibility that some of the feature's categorical values don't exist in the
examples. But in the scenario where max categorical features are 50 and the
number of examples are 49, an exception will be thrown. In both scenarios there
can be missing categorical values, so the below check in the code is not
guaranteed to catch the situation with missing values in the examples.
maxPossibleBins = math.min(strategy.maxBins, numExamples)
Please let us know your thoughts.
> DecisionTreeMetadata - setting maxPossibleBins to numExamples is incorrect.
> ----------------------------------------------------------------------------
>
> Key: SPARK-9075
> URL: https://issues.apache.org/jira/browse/SPARK-9075
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.4.0
> Reporter: Les Selecky
> Priority: Minor
>
> In
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
> there's a statement that sets maxPossibileBins to numExamples when
> numExamples is less than strategy.maxBins.
> This can cause an error when training small partitions; the error is
> triggered further down in the logic where it's required that
> maxCategoriesPerFeature be less than or equal to maxPossibleBins.
> Here's the an example of how it was manifested: the partition contained 49
> rows (i.e., numExamples=49 but strategy.maxBins was 57.
> The maxPossibleBins = math.min(strategy.maxBins, numExamples) logic therefore
> reduced maxPossibleBins to 49 causing the "require(maxCategoriesPerFeature <=
> maxPossibleBins" to throw an error.
> In short, this will be a problem when training small datasets with a feature
> that contains more categories than numExamples.
> In our local testing we commented out the "math.min(strategy.maxBins,
> numExamples)" line and the decision tree succeeded where it had failed
> previously.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]