[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

iyerr3 Thu, 26 Jul 2018 16:50:38 -0700

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/301


    DT/RF: Don't eliminate single-level categorical variable

    JIRA: MADLIB-1258
    
    When DT/RF is run with grouping, a subset of the groups could eliminate
    a categorical variable leading to multiple issues downstream, including
    invalid importance values and incorrect prediction.
    
    This commit keeps all categorical variables (even if it contains just
    one level). This would lead to some inefficiency during tree train,
    since the accumulator state would use additional space for this
    categorical variable but never use it in a tree. This inefficiency is
    still preferred for clean code and error-free prediction/importance
    reporting.
    
    Closes #301
    
    Co-authored-by: Nandish Jayaram <[email protected]>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib bugfix/dt_retain_cat_features

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/301.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #301
    
----
commit 089a4e2162a7b92dd288e5518ca8710f0aeac696
Author: Rahul Iyer <riyer@...>
Date:   2018-07-26T19:17:58Z

    DT/RF: Don't eliminate single-level categorical variable
    
    JIRA: MADLIB-1258
    
    When DT/RF is run with grouping, a subset of the groups could eliminate
    a categorical variable leading to multiple issues downstream, including
    invalid importance values and incorrect prediction.
    
    This commit keeps all categorical variables (even if it contains just
    one level). This would lead to some inefficiency during tree train,
    since the accumulator state would use additional space for this
    categorical variable but never use it in a tree. This inefficiency is
    still preferred for clean code and error-free prediction/importance
    reporting.
    
    Closes #301
    
    Co-authored-by: Nandish Jayaram <[email protected]>

----


---

[GitHub] madlib pull request #301: DT/RF: Don't eliminate single-level categorical va...

Reply via email to