Hi Tetsuo, I don't think it's the 'id' that is causing this issue, rather the array of features. Decision tree combines the continuous and categorical features in two separate arrays - one of those (most probably the continuous feature) is empty for a particular tuple. I can't comment more without looking at the dataset.
Within the array operations module, we're returning the message as "array_of_bigint" for a float array. That's a minor messaging bug; I'll fix that as part of the next commit. Best, Rahul On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <[email protected]> wrote: > Hi, > > I am currently having an error with the MADlib Random Forest function in > MADlib1.8.0. Below is the code I tried. > > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary; > SELECT madlib.forest_train('test_rf_data', -- input table name > 'rf_output', -- output table name > 'id', -- id column > 'duration', -- dependent variable > '*', -- list of features > NULL,-- exclude columns > 'linkid' -- grouping column > ,2::integer -- # of trees > ,5::integer, -- # of random features > TRUE::boolean, -- importance > 1, -- # of permutations > 5, -- max_tree_depth > 10, -- min_split > 3, -- min_bucket > 10 -- number of splits per continuous variable > ); > > When I tried this with all linkid (the grouping column with 362 linkids), > I got an error as in "error_random_forest.txt" attached here. The error > message is says I have the invalid array length but does not tell any > details what features in the data have this issue. > > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given > > I guessed this is the error for the bigint columns but the only bigint > columns is the "id" column. I once had an error that some features have > identical values in all records, but it is not the case this time because I > changed the sample size for each linkid as 1000 or above. > It seems something is zero from the DETAIL saying "0 given" but I have no > idea what in the data this is referring to. > > > The schema of the input table is as below; > CREATE TABLE input_table ( > id bigint, > linkid varchar(32), > duration double precision, > sat_flg int, > sun_flg int, > holiday_flg int, > semi_holiday_flg int, > renkyu_flg int, > ave_temp numeric, > ave_wind numeric, > precip numeric, > radiation numeric, > ave_speed numeric, > travel_time numeric, > ); > > Can anybody please let me know what the possible cause of this error? The > MADlib linear regression worked without any problems. > > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS. > > > Thank you, > > Tetsuo >
