[
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016151#comment-16016151
]
Frank McQuillan edited comment on MADLIB-965 at 5/18/17 6:09 PM:
-----------------------------------------------------------------
Running the example similar to the one in the description above produces:
{code}
select * from train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+------------------------------------------------------
method | forest_train
is_classification | t
source_table | dt_golf
model_table | train_output
id_col_name | id
dependent_varname | class
independent_varnames | "OUTLOOK",windy,"Cont_features"[1],"Cont_features"[2]
cat_features | "OUTLOOK",windy
con_features | "Cont_features"[1],"Cont_features"[2]
grouping_cols |
num_trees | 20
num_random_features | 2
max_tree_depth | 8
min_split | 3
min_bucket | 1
num_splits | 10
verbose | f
importance | t
num_permutations | 1
num_all_groups | 1
num_failed_groups | 0
total_rows_processed | 14
total_rows_skipped | 0
dependent_var_levels | "'Don't Play'","'Play'"
dependent_var_type | text
independent_var_types | text, boolean, double precision, double precision
{code}
was (Author: fmcquillan):
Running the example from above works now:
{code}
select * from train_output_summary;
{code}
produces
{code}
-[ RECORD 1 ]---------+-----------------------------------
method | forest_train
is_classification | t
source_table | dt_golf_array
model_table | train_output
id_col_name | id
dependent_varname | class
independent_varnames | input_array[1],input_array[2]
cat_features |
con_features | input_array[1],input_array[2]
grouping_cols |
num_trees | 20
num_random_features | 1
max_tree_depth | 8
min_split | 3
min_bucket | 1
num_splits | 10
verbose | f
importance | t
num_permutations | 1
num_all_groups | 1
num_failed_groups | 0
total_rows_processed | 14
total_rows_skipped | 0
dependent_var_levels | "Don't Play","Play"
dependent_var_type | text
independent_var_types | double precision, double precision
{code}
> RF and DT should accept array input for feature vector
> ------------------------------------------------------
>
> Key: MADLIB-965
> URL: https://issues.apache.org/jira/browse/MADLIB-965
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Decision Tree, Module: Random Forest
> Reporter: Rashmi Raghu
> Assignee: Rahul Iyer
> Priority: Minor
> Fix For: v1.12
>
> Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing
> array of features as input (instead of each feature in a separate column).
> The result was an error message but that message is unclear as to source of
> error (i.e. is it because of the array feature input column or something
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
> id integer NOT NULL,
> "OUTLOOK" text,
> temperature double precision,
> humidity double precision,
> windy text,
> class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as
> select id, array[temperature, humidity] as input_array, class
> from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array', -- source table
> 'train_output', -- output model table
> 'id', -- id column
> 'class', -- response
> 'input_array', -- features
> NULL, -- exclude columns
> NULL, -- grouping columns
> 20::integer, -- number of trees
> 1::integer, -- number of random features
> TRUE::boolean, -- variable importance
> 1::integer, -- num_permutations
> 8::integer, -- max depth
> 3::integer, -- min split
> 1::integer, -- min bucket
> 10::integer -- number of splits per
> continuous variable
> );
> NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named
> 'id' as the Greenplum Database data distribution key for this table.
> HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT: Traceback (most recent call last):
> PL/Python function "forest_train", line 42, in <module>
> sample_ratio
> PL/Python function "forest_train", line 589, in forest_train
> PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ********** Error **********
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
> PL/Python function "forest_train", line 42, in <module>
> sample_ratio
> PL/Python function "forest_train", line 589, in forest_train
> PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)