[jira] [Comment Edited] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Frank McQuillan (JIRA) Thu, 18 May 2017 10:00:30 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016072#comment-16016072
 ]


Frank McQuillan edited comment on MADLIB-1087 at 5/18/17 4:59 PM:
------------------------------------------------------------------

Also testing the array element thingy:

{code}
DROP TABLE IF EXISTS dt_golf;
CREATE TABLE dt_golf (
    id integer NOT NULL,
    outlook text,
    temperature double precision,
    humidity double precision,
    windy text,
    class text
) ;

INSERT INTO dt_golf (id,outlook,temperature,humidity,windy,class) VALUES
(1, 'sunny', 85, 85, 'false', 'Don''t Play'),
(2, 'sunny', 80, 90, 'true', 'Don''t Play'),
(3, 'overcast', 83, 78, 'false', 'Play'),
(4, 'rain', 70, 96, 'false', 'Play'),
(5, 'rain', 68, 80, 'false', 'Play'),
(6, 'rain', 65, 70, 'true', 'Don''t Play'),
(7, 'overcast', 64, 65, 'true', 'Play'),
(8, 'sunny', 72, 95, 'false', 'Don''t Play'),
(9, 'sunny', 69, 70, 'false', 'Play'),
(10, 'rain', 75, 80, 'false', 'Play'),
(11, 'sunny', 75, 70, 'true', 'Play'),
(12, 'overcast', 72, 90, 'true', 'Play'),
(13, 'overcast', 81, 75, 'false', 'Play'),
(14, 'rain', 71, 80, 'true', 'Don''t Play');
DROP TABLE IF EXISTS dt_golf_array;
CREATE TABLE dt_golf_array as 
select id, array[temperature, humidity] as input_array, class
from dt_golf;

SELECT * FROM dt_golf_array ORDER BY id;
{code}
produces
{code}
 id | input_array |   class    
----+-------------+------------
  1 | {85,85}     | Don't Play
  2 | {80,90}     | Don't Play
  3 | {83,78}     | Play
  4 | {70,96}     | Play
  5 | {68,80}     | Play
  6 | {65,70}     | Don't Play
  7 | {64,65}     | Play
  8 | {72,95}     | Don't Play
  9 | {69,70}     | Play
 10 | {75,80}     | Play
 11 | {75,70}     | Play
 12 | {72,90}     | Play
 13 | {81,75}     | Play
 14 | {71,80}     | Don't Play
(14 rows)
{code}
and running RF
{code}
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('dt_golf_array', -- source table
'train_output', -- output model table
'id', -- id column
'class', -- response
'input_array[1], input_array[2]', -- features
NULL, -- exclude columns
NULL, -- grouping columns
20::integer, -- number of trees
1::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
8::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer -- number of splits per continuous variable
);
SELECT * FROM train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+-----------------------------------
method                | forest_train
is_classification     | t
source_table          | dt_golf_array
model_table           | train_output
id_col_name           | id
dependent_varname     | class
independent_varnames  | input_array[1],input_array[2]
cat_features          | 
con_features          | input_array[1],input_array[2]
grouping_cols         | 
num_trees             | 20
num_random_features   | 1
max_tree_depth        | 8
min_split             | 3
min_bucket            | 1
num_splits            | 10
verbose               | f
importance            | t
num_permutations      | 1
num_all_groups        | 1
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 0
dependent_var_levels  | "Don't Play","Play"
dependent_var_type    | text
independent_var_types | double precision, double precision
{code}


was (Author: fmcquillan):
Also

{code}
DROP TABLE IF EXISTS dt_golf;
CREATE TABLE dt_golf (
    id integer NOT NULL,
    outlook text,
    temperature double precision,
    humidity double precision,
    windy text,
    class text
) ;

INSERT INTO dt_golf (id,outlook,temperature,humidity,windy,class) VALUES
(1, 'sunny', 85, 85, 'false', 'Don''t Play'),
(2, 'sunny', 80, 90, 'true', 'Don''t Play'),
(3, 'overcast', 83, 78, 'false', 'Play'),
(4, 'rain', 70, 96, 'false', 'Play'),
(5, 'rain', 68, 80, 'false', 'Play'),
(6, 'rain', 65, 70, 'true', 'Don''t Play'),
(7, 'overcast', 64, 65, 'true', 'Play'),
(8, 'sunny', 72, 95, 'false', 'Don''t Play'),
(9, 'sunny', 69, 70, 'false', 'Play'),
(10, 'rain', 75, 80, 'false', 'Play'),
(11, 'sunny', 75, 70, 'true', 'Play'),
(12, 'overcast', 72, 90, 'true', 'Play'),
(13, 'overcast', 81, 75, 'false', 'Play'),
(14, 'rain', 71, 80, 'true', 'Don''t Play');
DROP TABLE IF EXISTS dt_golf_array;
CREATE TABLE dt_golf_array as 
select id, array[temperature, humidity] as input_array, class
from dt_golf;
SELECT * FROM dt_golf_array ORDER BY id;
{code}

{code}
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('dt_golf_array', -- source table
'train_output', -- output model table
'id', -- id column
'class', -- response
'input_array[1], input_array[2]', -- features
NULL, -- exclude columns
NULL, -- grouping columns
20::integer, -- number of trees
1::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
8::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer -- number of splits per continuous variable
);
SELECT * FROM train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+-----------------------------------
method                | forest_train
is_classification     | t
source_table          | dt_golf_array
model_table           | train_output
id_col_name           | id
dependent_varname     | class
independent_varnames  | input_array[1],input_array[2]
cat_features          | 
con_features          | input_array[1],input_array[2]
grouping_cols         | 
num_trees             | 20
num_random_features   | 1
max_tree_depth        | 8
min_split             | 3
min_bucket            | 1
num_splits            | 10
verbose               | f
importance            | t
num_permutations      | 1
num_all_groups        | 1
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 0
dependent_var_levels  | "Don't Play","Play"
dependent_var_type    | text
independent_var_types | double precision, double precision
{code}

> Random Forest fails if features are INT or NUMERIC only and variable 
> importance is TRUE
> ---------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1087
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1087
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Paul Chang
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>
> If we attempt to train on a dataset where all features are either INT or 
> NUMERIC, and with variable importance TRUE, forest_train() fails with the 
> following error:
> [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length 
> (plpython.c:4648)
> [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0 
> given
> [2017-04-03 13:35:35] Where: Traceback (most recent call last):
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in <module>
> [2017-04-03 13:35:35] sample_ratio
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in 
> forest_train
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in 
> _calculate_oob_prediction
> [2017-04-03 13:35:35] PL/Python function "forest_train"
> However, if we add a single feature column that is FLOAT, REAL, or DOUBLE 
> PRECISION, the trainer does not fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Reply via email to