GitHub user iyerr3 opened a pull request:
https://github.com/apache/madlib/pull/301
DT/RF: Don't eliminate single-level categorical variable
JIRA: MADLIB-1258
When DT/RF is run with grouping, a subset of the groups could eliminate
a categorical variable leading to multiple issues downstream, including
invalid importance values and incorrect prediction.
This commit keeps all categorical variables (even if it contains just
one level). This would lead to some inefficiency during tree train,
since the accumulator state would use additional space for this
categorical variable but never use it in a tree. This inefficiency is
still preferred for clean code and error-free prediction/importance
reporting.
Closes #301
Co-authored-by: Nandish Jayaram <[email protected]>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib bugfix/dt_retain_cat_features
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/301.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #301
----
commit 089a4e2162a7b92dd288e5518ca8710f0aeac696
Author: Rahul Iyer <riyer@...>
Date: 2018-07-26T19:17:58Z
DT/RF: Don't eliminate single-level categorical variable
JIRA: MADLIB-1258
When DT/RF is run with grouping, a subset of the groups could eliminate
a categorical variable leading to multiple issues downstream, including
invalid importance values and incorrect prediction.
This commit keeps all categorical variables (even if it contains just
one level). This would lead to some inefficiency during tree train,
since the accumulator state would use additional space for this
categorical variable but never use it in a tree. This inefficiency is
still preferred for clean code and error-free prediction/importance
reporting.
Closes #301
Co-authored-by: Nandish Jayaram <[email protected]>
----
---