[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Frank McQuillan (JIRA) Fri, 05 May 2017 14:27:36 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998979#comment-15998979
 ]


Frank McQuillan commented on MADLIB-1087:
-----------------------------------------

Here is an example I will check in a bit, which should be fixed with this story:

{code}
DROP TABLE IF EXISTS train_output, train_output_group, train_output_summary;
SELECT madlib.forest_train('dt_golf_array', -- source table
'train_output', -- output model table
'id', -- id column
'class', -- response
'input_array[1], input_array[2]', -- features
NULL, -- exclude columns
NULL, -- grouping columns
20::integer, -- number of trees
1::integer, -- number of random features
TRUE::boolean, -- variable importance
1::integer, -- num_permutations
8::integer, -- max depth
3::integer, -- min split
1::integer, -- min bucket
10::integer -- number of splits per continuous variable
);
{code} 
and produces the same error:
{code}
DataError: (psycopg2.DataError) spiexceptions.InvalidParameterValue: invalid 
array length
DETAIL:  array_of_float: Size should be in [1, 1e7], 0 given
CONTEXT:  Traceback (most recent call last):
  PL/Python function "forest_train", line 39, in <module>
    sample_ratio
  PL/Python function "forest_train", line 602, in forest_train
  PL/Python function "forest_train", line 1049, in _calculate_oob_prediction
PL/Python function "forest_train"
 [SQL: "SELECT madlib.forest_train('dt_golf_array', -- source 
table\n'train_output', -- output model table\n'id', -- id column\n'class', -- 
response\n'input_array[1], input_array[2]', -- features\nNULL, -- exclude 
columns\nNULL, -- grouping columns\n20::integer, -- number of 
trees\n1::integer, -- number of random features\nTRUE::boolean, -- variable 
importance\n1::integer, -- num_permutations\n8::integer, -- max 
depth\n3::integer, -- min split\n1::integer, -- min bucket\n10::integer -- 
number of splits per continuous variable\n);"]
{code}

> Random Forest fails if features are INT or NUMERIC only and variable 
> importance is TRUE
> ---------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1087
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1087
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Paul Chang
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.12
>
>
> If we attempt to train on a dataset where all features are either INT or 
> NUMERIC, and with variable importance TRUE, forest_train() fails with the 
> following error:
> [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length 
> (plpython.c:4648)
> [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0 
> given
> [2017-04-03 13:35:35] Where: Traceback (most recent call last):
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in <module>
> [2017-04-03 13:35:35] sample_ratio
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in 
> forest_train
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in 
> _calculate_oob_prediction
> [2017-04-03 13:35:35] PL/Python function "forest_train"
> However, if we add a single feature column that is FLOAT, REAL, or DOUBLE 
> PRECISION, the trainer does not fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Reply via email to