[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Paul Chang (JIRA) Mon, 03 Apr 2017 13:40:10 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954136#comment-15954136
 ]


Paul Chang commented on MADLIB-1087:
------------------------------------

Here is a script I used to duplicate the problem:

DROP TABLE IF EXISTS public.paul_badrftest2;
CREATE TABLE public.paul_badrftest2 (
  id INT NOT NULL PRIMARY KEY
  , resp INTEGER NOT NULL
  , feat NUMERIC(38,19) NOT NULL
);
INSERT INTO public.paul_badrftest2 (id, resp, feat)
VALUES (0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4), (5, 5, 5), (6, 6, 
6), (7, 7, 7), (8, 8, 8), (9, 9, 9);

DROP TABLE IF EXISTS public.paul_badrftest2_train, 
public.paul_badrftest2_train_group, public.paul_badrftest2_train_summary;
SELECT "madlib"."forest_train"(
  '"public"."paul_badrftest2"', -- source
  '"public"."paul_badrftest2_train"', -- destination
  '"id"', -- id
  '"resp"', -- response
  '"feat"', -- features
  CAST(NULL AS text), -- exclude
  CAST(NULL AS text), -- grouping
  1, -- number of trees
  1, -- random features
  TRUE, -- variable importance
  1, -- random permutations
  5, -- max depth
  3, -- min split
  3, -- min bucket
  10, -- splits per continuous variable
  CAST(NULL AS text), -- surrogates
  FALSE, -- verbose
  1.0); -- sampling


> Random Forest fails if features are INT or NUMERIC only and variable 
> importance is TRUE
> ---------------------------------------------------------------------------------------
>
>                 Key: MADLIB-1087
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1087
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Random Forest
>            Reporter: Paul Chang
>
> If we attempt to train on a dataset where all features are either INT or 
> NUMERIC, and with variable importance TRUE, forest_train() fails with the 
> following error:
> [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length 
> (plpython.c:4648)
> [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0 
> given
> [2017-04-03 13:35:35] Where: Traceback (most recent call last):
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in <module>
> [2017-04-03 13:35:35] sample_ratio
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in 
> forest_train
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in 
> _calculate_oob_prediction
> [2017-04-03 13:35:35] PL/Python function "forest_train"
> However, if we add a single feature column that is FLOAT, REAL, or DOUBLE 
> PRECISION, the trainer does not fail.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (MADLIB-1087) Random Forest fails if features are INT or NUMERIC only and variable importance is TRUE

Reply via email to