[jira] [Commented] (MADLIB-925) Improve RF output format for variable importance (and new DT/RF impurity importance)

ASF GitHub Bot (JIRA) Wed, 18 Jul 2018 14:14:25 -0700


    [ 
https://issues.apache.org/jira/browse/MADLIB-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548406#comment-16548406
 ]


ASF GitHub Bot commented on MADLIB-925:
---------------------------------------

GitHub user njayaram2 opened a pull request:

    https://github.com/apache/madlib/pull/295

    Recursive Partitioning: Add function to report importance scores

    JIRA: MADLIB-925
    
    This commit adds a new MADlib function (get_var_importance) to report the
    importance scores in decision tree and random forest. RF models prior to
    MADlib 1.15 used to have variable importance scores reported, but they
    also have impurity variable importance from 1.15 onwards. This function
    reports both those scores for >=1.15 RF models, and only the oob variable
    importance score for <1.15 RF models.
    This function when called for a DT model, would return the impurity
    variable importance score for >=1.15 DT models.
    
    Co-authored-by: Jingyi Mei <[email protected]>
    Co-authored-by: Orhan Kislal <[email protected]>

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib feature/output-importance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #295
    
----
commit 54a4a17915f6ce1ddea6260db2d06fcd0ee50f51
Author: Nandish Jayaram <njayaram@...>
Date:   2018-07-03T19:22:07Z

    Recursive Partitioning: Add function to report importance scores
    
    JIRA: MADLIB-925
    
    This commit adds a new MADlib function (get_var_importance) to report the
    importance scores in decision tree and random forest. RF models prior to
    MADlib 1.15 used to have variable importance scores reported, but they
    also have impurity variable importance from 1.15 onwards. This function
    reports both those scores for >=1.15 RF models, and only the oob variable
    importance score for <1.15 RF models.
    This function when called for a DT model, would return the impurity
    variable importance score for >=1.15 DT models.
    
    Co-authored-by: Jingyi Mei <[email protected]>
    Co-authored-by: Orhan Kislal <[email protected]>

----


> Improve RF output format for variable importance  (and new DT/RF impurity 
> importance)
> -------------------------------------------------------------------------------------
>
>                 Key: MADLIB-925
>                 URL: https://issues.apache.org/jira/browse/MADLIB-925
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: Module: Random Forest
>            Reporter: Frank McQuillan
>            Assignee: Nandish Jayaram
>            Priority: Major
>              Labels: starter
>             Fix For: v1.15
>
>
> As a user,
> I want to have an easier way of accessing the variable importance output from 
> random forest so that I can understand which are the most important variables.
> Current method of getting variable importance for each variable (in a tabular 
> format - assuming output table name is `rf_output`): 
> ```
> SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable, 
>    unnest(cat_var_importance) as importance 
> FROM rf_output_group, rf_output_summary;
> ```
> This is a cumbersome query to write and has to be written twice - for 
> categorical and for continuous features.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-925) Improve RF output format for variable importance (and new DT/RF impurity importance)

Reply via email to