[jira] [Commented] (MADLIB-1254) RF/DT: Grouping might give incorrect results if 1 group eliminates a categorical variable

ASF GitHub Bot (JIRA) Wed, 18 Jul 2018 16:26:19 -0700


    [ 
https://issues.apache.org/jira/browse/MADLIB-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16548567#comment-16548567
 ]


ASF GitHub Bot commented on MADLIB-1254:
----------------------------------------

GitHub user iyerr3 opened a pull request:

    https://github.com/apache/madlib/pull/296

    DT/RF: Ensure cat features are recorded per group

    JIRA: MADLIB-1254
    
    If tree_train/forest_train is run with grouping enabled and if one of
    the groups has a categorical feature with just single level, then the
    categorical feature is eliminated for that group. If other groups retain
    that feature, then we end up with incorrect "bins" data structure built
    as part of DT.
    
    This commit fixes this issue by recording the categorical features
    present in each group separately.
    
    Closes #295

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib bugfix/rf_grouping_cat_levels

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/madlib/pull/296.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #296
    
----
commit bf5fa81c264471729ef06ee4af8a27b41f22b45a
Author: Rahul Iyer <riyer@...>
Date:   2018-07-18T00:10:04Z

    DT/RF: Ensure cat features are recorded per group
    
    JIRA: MADLIB-1254
    
    If tree_train/forest_train is run with grouping enabled and if one of
    the groups has a categorical feature with just single level, then the
    categorical feature is eliminated for that group. If other groups retain
    that feature, then we end up with incorrect "bins" data structure built
    as part of DT.
    
    This commit fixes this issue by recording the categorical features
    present in each group separately.
    
    Closes #295

----


> RF/DT: Grouping might give incorrect results if 1 group eliminates a 
> categorical variable
> -----------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1254
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1254
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>            Priority: Major
>             Fix For: v1.15
>
>
> If {{forest_train}} is run with grouping enabled and if one of the groups has 
> a categorical feature with just single level, then the categorical feature is 
> eliminated for that group. If other groups retain that feature, then the 
> output of impurity_var_importance is incorrect for the group in question. 
> There could be other ramifications related to this as well. 
> {code:java}
> DROP TABLE IF EXISTS dt_golf CASCADE;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     "Cont_features" double precision[],
>     cat_features text[],
>     windy boolean,
>     class text
> ) ;
> INSERT INTO dt_golf 
> (id,"OUTLOOK",temperature,humidity,"Cont_features",cat_features, windy,class) 
> VALUES
> (1, 'sunny', 85, 85,ARRAY[85, 85], ARRAY['a', 'b'], false, 'Don''t Play'),
> (2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['a', 'b'], true, 'Don''t Play'),
> (3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['a', 'b'], false, 'Play'),
> (4, 'rain', 70, NULL, ARRAY[70, 96], ARRAY['a', 'b'], false, 'Play'),
> (5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['a', 'b'], false, 'Play'),
> (6, 'rain', NULL, 70, ARRAY[65, 70], ARRAY['a', 'b'], true, 'Don''t Play'),
> (7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['c', 'b'], NULL , 'Play'),
> (8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['a', 'b'], false, 'Don''t Play'),
> (9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['a', 'b'], false, 'Play'),
> (10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['a', 'b'], false, 'Play'),
> (11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['a', 'd'], true, 'Play'),
> (12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['c', 'b'], NULL, 'Play'),
> (13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
> (15, NULL, 81, 75, ARRAY[81, 75], ARRAY['a', 'b'], false, 'Play'),
> (16, 'overcast', NULL, 75, ARRAY[81, 75], ARRAY['a', 'd'], false, 'Play'),
> (14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['c', 'b'], true, 'Don''t Play');
> DROP TABLE IF EXISTS train_output, train_output_summary, train_output_group, 
> train_output_poisson_count;
> SELECT forest_train(
>                   'dt_golf',         -- source table
>                   'train_output',    -- output model table
>                   'id',              -- id column
>                   'temperature::double precision',           -- response
>                   'humidity, cat_features, windy, "Cont_features"',   -- 
> features
>                   NULL,        -- exclude columns
>                   'class',          -- grouping
>                   5,                -- num of trees
>                   NULL,                 -- num of random features
>                   TRUE,     -- importance
>                   20,         -- num_permutations
>                   10,       -- max depth
>                   1,        -- min split
>                   1,        -- min bucket
>                   3,        -- number of bins per continuous variable
>                   'max_surrogates = 2 ',
>                   FALSE
>                   );
> \x on
> SELECT * from train_output_summary;
> SELECT * from train_output_group;
> {code}
> Results:
> {code:java}
> SELECT * from train_output_group;
> -[ RECORD 1 
> ]-----------+-----------------------------------------------------------------------------
> gid                     | 1
> class                   | Don't Play
> success                 | t
> cat_n_levels            | {2,2,2}
> cat_levels_in_text      | {c,a,True,False,c,a}
> oob_error               | 92.5335905349795
> oob_var_importance      | {10.725,10.725,10.725,7.605,10.725,0}
> impurity_var_importance | 
> {8.33148348160485,0,0,19.9999998625892,19.9999998625892,11.6685163809844}
> -[ RECORD 2 
> ]-----------+-----------------------------------------------------------------------------
> gid                     | 2
> class                   | Play
> success                 | t
> cat_n_levels            | {2,2}
> cat_levels_in_text      | {b,d,False,True}
> oob_error               | 43.0244073645405
> oob_var_importance      | 
> {1.06581410364015e-15,1.06581410364015e-15,2.1326171875,16.019375,10.570875}
> impurity_var_importance | 
> {0,0,0,37.8304000437732,38.4881698525677,23.6814277291654}
> {code}
> Note that the {{impurity_var_importance}} for {{gid=2}} has length 6 while 
> the {{oob_var_importance}} correctly has 5.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MADLIB-1254) RF/DT: Grouping might give incorrect results if 1 group eliminates a categorical variable

Reply via email to