I'm looking at this bit of code in DecisionTreeMetadata ...
val maxCategoriesForUnorderedFeature =
((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt
strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) =>
// Decide if some categorical features should be treated as
unordered features,
// which require 2 * ((1 << numCategories - 1) - 1) bins.
// We do this check with log values to prevent overflows in case
numCategories is large.
// The next check is equivalent to: 2 * ((1 << numCategories - 1) -
1) <= maxBins
if (numCategories <= maxCategoriesForUnorderedFeature) {
unorderedFeatures.add(featureIndex)
numBins(featureIndex) = numUnorderedBins(numCategories)
} else {
numBins(featureIndex) = numCategories
}
}
So if I have a feature with 40 values and less than about a trillion
bins, it gets treated as a continuous feature, which is meaningless.
It shortly throws an exception though since other parts of the code
expect this to be a categorical feature.
I think there's a bug here somewhere but wasn't sure whether it was
just 'not implemented' yet and so needs a better error message (and
should be implemented), or something else preventing this from working
as expected.
I'll wait a beat to get more info and then if needed open a JIRA. Thanks all.
On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <[email protected]> wrote:
> I was under the impression that we were using the usual sort by average
> response value heuristic when storing histogram bins (and searching for
> optimal splits) in the tree code.
>
> Maybe Manish or Joseph can clarify?
>
>> On Oct 12, 2014, at 2:50 PM, Sean Owen <[email protected]> wrote:
>>
>> I'm having trouble getting decision forests to work with categorical
>> features. I have a dataset with a categorical feature with 40 values.
>> It seems to be treated as a continuous/numeric value by the
>> implementation.
>>
>> Digging deeper, I see there is some logic in the code that indicates
>> that categorical features over N values do not work unless the number
>> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
>> naive brute force condition, wherein the decision tree will test all
>> possible splits of the categorical value.
>>
>> However, this gets unusable quickly as the number of bins should be
>> tens or hundreds at best, and this requirement rules out categorical
>> values over more than 10 or so features as a result. But, of course,
>> it's not unusual to have categorical features with high cardinality.
>> It's almost common.
>>
>> There are some pretty fine heuristics for selecting 'bins' over
>> categorical features when the number of bins is far fewer than the
>> complete, exhaustive set.
>>
>> Before I open a JIRA or continue, does anyone know what I am talking
>> about, am I mistaken? Is this a real limitation and is it worth
>> pursuing these heuristics? I can't figure out how to proceed with
>> decision forests in MLlib otherwise.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]