Hi Rahul, Thank you for your comment. It seems I need to investigate the continuous features more to find out what the issue is.
Based on your comment, I know the madlib.forest_train() separates the continuous features and categorical features but are there any rules how the function separate the two? I see some continuous features are recognized as categorical features when I see cat_features in the output_summary table. Are there any ways I can manually specify what features are continuous and what are categorical? Thank you, Tesuo 2015-12-01 4:09 GMT+09:00 Rahul Iyer <[email protected]>: > Hi Tetsuo, > > I don't think it's the 'id' that is causing this issue, rather the array of > features. Decision tree combines the continuous and categorical features in > two separate arrays - one of those (most probably the continuous feature) > is empty for a particular tuple. I can't comment more without looking at > the dataset. > > Within the array operations module, we're returning the message as > "array_of_bigint" for a float array. That's a minor messaging bug; I'll fix > that as part of the next commit. > > Best, > Rahul > > On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi <[email protected]> > wrote: > > > Hi, > > > > I am currently having an error with the MADlib Random Forest function in > > MADlib1.8.0. Below is the code I tried. > > > > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary; > > SELECT madlib.forest_train('test_rf_data', -- input table name > > 'rf_output', -- output table name > > 'id', -- id column > > 'duration', -- dependent variable > > '*', -- list of features > > NULL,-- exclude columns > > 'linkid' -- grouping column > > ,2::integer -- # of trees > > ,5::integer, -- # of random features > > TRUE::boolean, -- importance > > 1, -- # of permutations > > 5, -- max_tree_depth > > 10, -- min_split > > 3, -- min_bucket > > 10 -- number of splits per continuous > variable > > ); > > > > When I tried this with all linkid (the grouping column with 362 linkids), > > I got an error as in "error_random_forest.txt" attached here. The error > > message is says I have the invalid array length but does not tell any > > details what features in the data have this issue. > > > > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) > > DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given > > > > I guessed this is the error for the bigint columns but the only bigint > > columns is the "id" column. I once had an error that some features have > > identical values in all records, but it is not the case this time > because I > > changed the sample size for each linkid as 1000 or above. > > It seems something is zero from the DETAIL saying "0 given" but I have no > > idea what in the data this is referring to. > > > > > > The schema of the input table is as below; > > CREATE TABLE input_table ( > > id bigint, > > linkid varchar(32), > > duration double precision, > > sat_flg int, > > sun_flg int, > > holiday_flg int, > > semi_holiday_flg int, > > renkyu_flg int, > > ave_temp numeric, > > ave_wind numeric, > > precip numeric, > > radiation numeric, > > ave_speed numeric, > > travel_time numeric, > > ); > > > > Can anybody please let me know what the possible cause of this error? The > > MADlib linear regression worked without any problems. > > > > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS. > > > > > > Thank you, > > > > Tetsuo > > > -- ---------------------------------------- Pivotalジャパン株式会社 小林哲郎 (Tetsuo Kobayashi) Senior Data Scientist E-mail: [email protected] TEL: 080-9979-0757(携帯) ----------------------------------------
