[
https://issues.apache.org/jira/browse/MADLIB-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954136#comment-15954136
]
Paul Chang commented on MADLIB-1087:
------------------------------------
Here is a script I used to duplicate the problem:
DROP TABLE IF EXISTS public.paul_badrftest2;
CREATE TABLE public.paul_badrftest2 (
id INT NOT NULL PRIMARY KEY
, resp INTEGER NOT NULL
, feat NUMERIC(38,19) NOT NULL
);
INSERT INTO public.paul_badrftest2 (id, resp, feat)
VALUES (0, 0, 0), (1, 1, 1), (2, 2, 2), (3, 3, 3), (4, 4, 4), (5, 5, 5), (6, 6,
6), (7, 7, 7), (8, 8, 8), (9, 9, 9);
DROP TABLE IF EXISTS public.paul_badrftest2_train,
public.paul_badrftest2_train_group, public.paul_badrftest2_train_summary;
SELECT "madlib"."forest_train"(
'"public"."paul_badrftest2"', -- source
'"public"."paul_badrftest2_train"', -- destination
'"id"', -- id
'"resp"', -- response
'"feat"', -- features
CAST(NULL AS text), -- exclude
CAST(NULL AS text), -- grouping
1, -- number of trees
1, -- random features
TRUE, -- variable importance
1, -- random permutations
5, -- max depth
3, -- min split
3, -- min bucket
10, -- splits per continuous variable
CAST(NULL AS text), -- surrogates
FALSE, -- verbose
1.0); -- sampling
> Random Forest fails if features are INT or NUMERIC only and variable
> importance is TRUE
> ---------------------------------------------------------------------------------------
>
> Key: MADLIB-1087
> URL: https://issues.apache.org/jira/browse/MADLIB-1087
> Project: Apache MADlib
> Issue Type: Bug
> Components: Module: Random Forest
> Reporter: Paul Chang
>
> If we attempt to train on a dataset where all features are either INT or
> NUMERIC, and with variable importance TRUE, forest_train() fails with the
> following error:
> [2017-04-03 13:35:35] [XX000] ERROR: plpy.SPIError: invalid array length
> (plpython.c:4648)
> [2017-04-03 13:35:35] Detail: array_of_bigint: Size should be in [1, 1e7], 0
> given
> [2017-04-03 13:35:35] Where: Traceback (most recent call last):
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 42, in <module>
> [2017-04-03 13:35:35] sample_ratio
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 591, in
> forest_train
> [2017-04-03 13:35:35] PL/Python function "forest_train", line 1038, in
> _calculate_oob_prediction
> [2017-04-03 13:35:35] PL/Python function "forest_train"
> However, if we add a single feature column that is FLOAT, REAL, or DOUBLE
> PRECISION, the trainer does not fail.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)