I was under the impression that we were using the usual sort by average 
response value heuristic when storing histogram bins (and searching for optimal 
splits) in the tree code. 

Maybe Manish or Joseph can clarify?

> On Oct 12, 2014, at 2:50 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> I'm having trouble getting decision forests to work with categorical
> features. I have a dataset with a categorical feature with 40 values.
> It seems to be treated as a continuous/numeric value by the
> implementation.
> 
> Digging deeper, I see there is some logic in the code that indicates
> that categorical features over N values do not work unless the number
> of bins is at least 2*((2^N - 1) - 1) bins. I understand this as the
> naive brute force condition, wherein the decision tree will test all
> possible splits of the categorical value.
> 
> However, this gets unusable quickly as the number of bins should be
> tens or hundreds at best, and this requirement rules out categorical
> values over more than 10 or so features as a result. But, of course,
> it's not unusual to have categorical features with high cardinality.
> It's almost common.
> 
> There are some pretty fine heuristics for selecting 'bins' over
> categorical features when the number of bins is far fewer than the
> complete, exhaustive set.
> 
> Before I open a JIRA or continue, does anyone know what I am talking
> about, am I mistaken? Is this a real limitation and is it worth
> pursuing these heuristics? I can't figure out how to proceed with
> decision forests in MLlib otherwise.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to