Rashmi Raghu created MADLIB-965:
-----------------------------------

             Summary: Error message unclear when running Random Forest 
'forest_train' function
                 Key: MADLIB-965
                 URL: https://issues.apache.org/jira/browse/MADLIB-965
             Project: Apache MADlib
          Issue Type: Bug
          Components: Module: Random Forest
            Reporter: Rashmi Raghu


We were trying to test whether the RF module could handle a column containing 
array of features as input (instead of each feature in a separate column). The 
result was an error message but that message is unclear as to source of error 
(i.e. is it because of the array feature input column or something else). 
Example table, query and error can be found below:
{quote}
-- Executing query:
DROP TABLE IF EXISTS dt_golf;
CREATE TABLE dt_golf (
    id integer NOT NULL,
    "OUTLOOK" text,
    temperature double precision,
    humidity double precision,
    windy text,
    class text
) ;

-- Executing query:
INSERT INTO dt_golf (id,"OUTLOOK",temperature,humidity,windy,class) VALUES
(1, 'sunny', 85, 85, 'false', 'Don''t Play'),
(2, 'sunny', 80, 90, 'true', 'Don''t Play'),
(3, 'overcast', 83, 78, 'false', 'Play'),
(4, 'rain', 70, 96, 'false', 'Play'),
(5, 'rain', 68, 80, 'false', 'Play'),
(6, 'rain', 65, 70, 'true', 'Don''t Play'),
(7, 'overcast', 64, 65, 'true', 'Play'),
(8, 'sunny', 72, 95, 'false', 'Don''t Play'),
(9, 'sunny', 69, 70, 'false', 'Play'),
(10, 'rain', 75, 80, 'false', 'Play'),
(11, 'sunny', 75, 70, 'true', 'Play'),
(12, 'overcast', 72, 90, 'true', 'Play'),
(13, 'overcast', 81, 75, 'false', 'Play'),
(14, 'rain', 71, 80, 'true', 'Don''t Play');

DROP TABLE IF EXISTS dt_golf_array;
CREATE TABLE dt_golf_array as 
    select id, array[temperature, humidity] as input_array, class
    from dt_golf
distributed by (id);

DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('dt_golf_array',         -- source table
                           'train_output',    -- output model table
                           'id',              -- id column
                           'class',           -- response
                           'input_array',   -- features
                           NULL,              -- exclude columns
                           NULL,              -- grouping columns
                           20::integer,       -- number of trees
                           1::integer,        -- number of random features
                           TRUE::boolean,     -- variable importance
                           1::integer,        -- num_permutations
                           8::integer,        -- max depth
                           3::integer,        -- min split
                           1::integer,        -- min bucket
                           10::integer        -- number of splits per 
continuous variable
                           );
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' 
as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make 
sure column(s) chosen are the optimal data distribution key to minimize skew.
query result with 1 row discarded.

ERROR:  plpy.SPIError: invalid array length (plpython.c:4648)
DETAIL:  array_of_bigint: Size should be in [1, 1e7], 0 given
CONTEXT:  Traceback (most recent call last):
  PL/Python function "forest_train", line 42, in <module>
    sample_ratio
  PL/Python function "forest_train", line 589, in forest_train
  PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
PL/Python function "forest_train"
********** Error **********

ERROR: plpy.SPIError: invalid array length (plpython.c:4648)
SQL state: XX000
Detail: array_of_bigint: Size should be in [1, 1e7], 0 given
Context: Traceback (most recent call last):
  PL/Python function "forest_train", line 42, in <module>
    sample_ratio
  PL/Python function "forest_train", line 589, in forest_train
  PL/Python function "forest_train", line 1037, in _calculate_oob_prediction
PL/Python function "forest_train"
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to