[
https://issues.apache.org/jira/browse/MADLIB-925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548406#comment-16548406
]
ASF GitHub Bot commented on MADLIB-925:
---------------------------------------
GitHub user njayaram2 opened a pull request:
https://github.com/apache/madlib/pull/295
Recursive Partitioning: Add function to report importance scores
JIRA: MADLIB-925
This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest. RF models prior to
MADlib 1.15 used to have variable importance scores reported, but they
also have impurity variable importance from 1.15 onwards. This function
reports both those scores for >=1.15 RF models, and only the oob variable
importance score for <1.15 RF models.
This function when called for a DT model, would return the impurity
variable importance score for >=1.15 DT models.
Co-authored-by: Jingyi Mei <[email protected]>
Co-authored-by: Orhan Kislal <[email protected]>
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/madlib/madlib feature/output-importance
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/madlib/pull/295.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #295
----
commit 54a4a17915f6ce1ddea6260db2d06fcd0ee50f51
Author: Nandish Jayaram <njayaram@...>
Date: 2018-07-03T19:22:07Z
Recursive Partitioning: Add function to report importance scores
JIRA: MADLIB-925
This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest. RF models prior to
MADlib 1.15 used to have variable importance scores reported, but they
also have impurity variable importance from 1.15 onwards. This function
reports both those scores for >=1.15 RF models, and only the oob variable
importance score for <1.15 RF models.
This function when called for a DT model, would return the impurity
variable importance score for >=1.15 DT models.
Co-authored-by: Jingyi Mei <[email protected]>
Co-authored-by: Orhan Kislal <[email protected]>
----
> Improve RF output format for variable importance (and new DT/RF impurity
> importance)
> -------------------------------------------------------------------------------------
>
> Key: MADLIB-925
> URL: https://issues.apache.org/jira/browse/MADLIB-925
> Project: Apache MADlib
> Issue Type: Improvement
> Components: Module: Random Forest
> Reporter: Frank McQuillan
> Assignee: Nandish Jayaram
> Priority: Major
> Labels: starter
> Fix For: v1.15
>
>
> As a user,
> I want to have an easier way of accessing the variable importance output from
> random forest so that I can understand which are the most important variables.
> Current method of getting variable importance for each variable (in a tabular
> format - assuming output table name is `rf_output`):
> ```
> SELECT unnest(regexp_split_to_array(cat_features, ',')) as variable,
> unnest(cat_var_importance) as importance
> FROM rf_output_group, rf_output_summary;
> ```
> This is a cumbersome query to write and has to be written twice - for
> categorical and for continuous features.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)