[jira] [Comment Edited] (MADLIB-965) RF and DT should accept array input for feature vector

Frank McQuillan (JIRA) Thu, 18 May 2017 11:10:27 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016151#comment-16016151
 ]


Frank McQuillan edited comment on MADLIB-965 at 5/18/17 6:09 PM:
-----------------------------------------------------------------

Running the example similar to the one in the description above produces:
{code}
select * from train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+------------------------------------------------------
method                | forest_train
is_classification     | t
source_table          | dt_golf
model_table           | train_output
id_col_name           | id
dependent_varname     | class
independent_varnames  | "OUTLOOK",windy,"Cont_features"[1],"Cont_features"[2]
cat_features          | "OUTLOOK",windy
con_features          | "Cont_features"[1],"Cont_features"[2]
grouping_cols         | 
num_trees             | 20
num_random_features   | 2
max_tree_depth        | 8
min_split             | 3
min_bucket            | 1
num_splits            | 10
verbose               | f
importance            | t
num_permutations      | 1
num_all_groups        | 1
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 0
dependent_var_levels  | "'Don't Play'","'Play'"
dependent_var_type    | text
independent_var_types | text, boolean, double precision, double precision
{code}


was (Author: fmcquillan):
Running the example from above works now:
{code}
select * from train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+-----------------------------------
method                | forest_train
is_classification     | t
source_table          | dt_golf_array
model_table           | train_output
id_col_name           | id
dependent_varname     | class
independent_varnames  | input_array[1],input_array[2]
cat_features          | 
con_features          | input_array[1],input_array[2]
grouping_cols         | 
num_trees             | 20
num_random_features   | 1
max_tree_depth        | 8
min_split             | 3
min_bucket            | 1
num_splits            | 10
verbose               | f
importance            | t
num_permutations      | 1
num_all_groups        | 1
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 0
dependent_var_levels  | "Don't Play","Play"
dependent_var_type    | text
independent_var_types | double precision, double precision
{code}

> RF and DT should accept array input for feature vector
> ------------------------------------------------------
>
>                 Key: MADLIB-965
>                 URL: https://issues.apache.org/jira/browse/MADLIB-965
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Decision Tree, Module: Random Forest
>            Reporter: Rashmi Raghu
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>         Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     windy text,
>     class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
>     select id, array[temperature, humidity] as input_array, class
>     from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array',         -- source table
>                            'train_output',    -- output model table
>                            'id',              -- id column
>                            'class',           -- response
>                            'input_array',   -- features
>                            NULL,              -- exclude columns
>                            NULL,              -- grouping columns
>                            20::integer,       -- number of trees
>                            1::integer,        -- number of random features
>                            TRUE::boolean,     -- variable importance
>                            1::integer,        -- num_permutations
>                            8::integer,        -- max depth
>                            3::integer,        -- min split
>                            1::integer,        -- min bucket
>                            10::integer        -- number of splits per 
> continuous variable
>                            );
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ********** Error **********
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (MADLIB-965) RF and DT should accept array input for feature vector

Reply via email to