[ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-2756:
-------------------------------------

    Description: 
3 bugs:

Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).

Bug 2: calculateGainForSplit (for classification):
* It returns dummy prediction values when either the right or left children had 
0 weight.  These are incorrect for multiclass classification.

Bug 3: Off-by-1 when finding thresholds for splits for continuous features.
* When finding thresholds for possible splits for continuous features in 
DecisionTree.findSplitsBins, the thresholds were set according to individual 
training examples’ feature values.  This can cause problems for small datasets.


  was:
2 bugs:

Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).

Bug 2: calculateGainForSplit (for classification):
* It returns dummy prediction values when either the right or left children had 
0 weight.  These are incorrect for multiclass classification.



> Decision Tree bugs
> ------------------
>
>                 Key: SPARK-2756
>                 URL: https://issues.apache.org/jira/browse/SPARK-2756
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> 3 bugs:
> Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
> features (in multiclass classification with categorical features, where the 
> features had few enough values such that they could be considered unordered, 
> i.e., isSpaceSufficientForAllCategoricalSplits=true).
> * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
> binIndex), where
> ** featureValue was from arr (so it was a feature value)
> ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
> * The rest of the code indexed agg as (node, feature, binIndex, label).
> Bug 2: calculateGainForSplit (for classification):
> * It returns dummy prediction values when either the right or left children 
> had 0 weight.  These are incorrect for multiclass classification.
> Bug 3: Off-by-1 when finding thresholds for splits for continuous features.
> * When finding thresholds for possible splits for continuous features in 
> DecisionTree.findSplitsBins, the thresholds were set according to individual 
> training examples’ feature values.  This can cause problems for small 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to