I'm looking at this bit of code in DecisionTreeMetadata ... val maxCategoriesForUnorderedFeature = ((math.log(maxPossibleBins / 2 + 1) / math.log(2.0)) + 1).floor.toInt strategy.categoricalFeaturesInfo.foreach { case (featureIndex, numCategories) => // Decide if some categorical features should be treated as unordered features, // which require 2 * ((1 << numCategories - 1) - 1) bins. // We do this check with log values to prevent overflows in case numCategories is large. // The next check is equivalent to: 2 * ((1 << numCategories - 1) - 1) <= maxBins if (numCategories <= maxCategoriesForUnorderedFeature) { unorderedFeatures.add(featureIndex) numBins(featureIndex) = numUnorderedBins(numCategories) } else { numBins(featureIndex) = numCategories } }
So if I have a feature with 40 values and less than about a trillion bins, it gets treated as a continuous feature, which is meaningless. It shortly throws an exception though since other parts of the code expect this to be a categorical feature. I think there's a bug here somewhere but wasn't sure whether it was just 'not implemented' yet and so needs a better error message (and should be implemented), or something else preventing this from working as expected. I'll wait a beat to get more info and then if needed open a JIRA. Thanks all. On Mon, Oct 13, 2014 at 3:34 AM, Evan Sparks <evan.spa...@gmail.com> wrote: > I was under the impression that we were using the usual sort by average > response value heuristic when storing histogram bins (and searching for > optimal splits) in the tree code. > > Maybe Manish or Joseph can clarify? > >> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote: >> >> I'm having trouble getting decision forests to work with categorical >> features. I have a dataset with a categorical feature with 40 values. >> It seems to be treated as a continuous/numeric value by the >> implementation. >> >> Digging deeper, I see there is some logic in the code that indicates >> that categorical features over N values do not work unless the number >> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the >> naive brute force condition, wherein the decision tree will test all >> possible splits of the categorical value. >> >> However, this gets unusable quickly as the number of bins should be >> tens or hundreds at best, and this requirement rules out categorical >> values over more than 10 or so features as a result. But, of course, >> it's not unusual to have categorical features with high cardinality. >> It's almost common. >> >> There are some pretty fine heuristics for selecting 'bins' over >> categorical features when the number of bins is far fewer than the >> complete, exhaustive set. >> >> Before I open a JIRA or continue, does anyone know what I am talking >> about, am I mistaken? Is this a real limitation and is it worth >> pursuing these heuristics? I can't figure out how to proceed with >> decision forests in MLlib otherwise. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org