[ 
https://issues.apache.org/jira/browse/MADLIB-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan resolved MADLIB-965.
------------------------------------
    Resolution: Fixed

> RF and DT should accept array input for feature vector
> ------------------------------------------------------
>
>                 Key: MADLIB-965
>                 URL: https://issues.apache.org/jira/browse/MADLIB-965
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Decision Tree, Module: Random Forest
>            Reporter: Rashmi Raghu
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>         Attachments: DT and RF work1.ipynb
>
>
> We were trying to test whether the RF module could handle a column containing 
> array of features as input (instead of each feature in a separate column). 
> The result was an error message but that message is unclear as to source of 
> error (i.e. is it because of the array feature input column or something 
> else). Example table, query and error can be found below:
> {quote}
> -- Executing query:
> DROP TABLE IF EXISTS dt_golf;
> CREATE TABLE dt_golf (
>     id integer NOT NULL,
>     "OUTLOOK" text,
>     temperature double precision,
>     humidity double precision,
>     windy text,
>     class text
> ) ;
> -- Executing query:
> INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
> (1, 'sunny', 85, 85, 'false', 'Don''t Play'),
> (2, 'sunny', 80, 90, 'true', 'Don''t Play'),
> (3, 'overcast', 83, 78, 'false', 'Play'),
> (4, 'rain', 70, 96, 'false', 'Play'),
> (5, 'rain', 68, 80, 'false', 'Play'),
> (6, 'rain', 65, 70, 'true', 'Don''t Play'),
> (7, 'overcast', 64, 65, 'true', 'Play'),
> (8, 'sunny', 72, 95, 'false', 'Don''t Play'),
> (9, 'sunny', 69, 70, 'false', 'Play'),
> (10, 'rain', 75, 80, 'false', 'Play'),
> (11, 'sunny', 75, 70, 'true', 'Play'),
> (12, 'overcast', 72, 90, 'true', 'Play'),
> (13, 'overcast', 81, 75, 'false', 'Play'),
> (14, 'rain', 71, 80, 'true', 'Don''t Play');
> DROP TABLE IF EXISTS dt_golf_array;
> CREATE TABLE dt_golf_array as 
>     select id, array[temperature, humidity] as input_array, class
>     from dt_golf
> distributed by (id);
> DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
> SELECT madlib.forest_train('dt_golf_array',         -- source table
>                            'train_output',    -- output model table
>                            'id',              -- id column
>                            'class',           -- response
>                            'input_array',   -- features
>                            NULL,              -- exclude columns
>                            NULL,              -- grouping columns
>                            20::integer,       -- number of trees
>                            1::integer,        -- number of random features
>                            TRUE::boolean,     -- variable importance
>                            1::integer,        -- num_permutations
>                            8::integer,        -- max depth
>                            3::integer,        -- min split
>                            1::integer,        -- min bucket
>                            10::integer        -- number of splits per 
> continuous variable
>                            );
> NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 
> 'id' as the Greenplum Database data distribution key for this table.
> HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
> sure column(s) chosen are the optimal data distribution key to minimize skew.
> query result with 1 row discarded.
> ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
> DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> ********** Error **********
> ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
> SQL state: XX000
> Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
> Context: Traceback (most recent call last):
>   PL/Python function "forest_train", line 42, in <module>
>     sample_ratio
>   PL/Python function "forest_train", line 589, in forest_train
>   PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
> PL/Python function "forest_train"
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to