Hi Rahul,
This helps a lot. Thank you for your support. Tetsuo 2015年12月2日水曜日、Rahul Iyer<[email protected]>さんは書きました: > Hi Tetsuo, > > Random forest uses decision tree module that builds the features. The DT > doc page <http://doc.madlib.net/latest/group__grp__decision__tree.html> > says: "... boolean, integer, and text columns are considered categorical > and double precision columns are considered continuous". > > Casting your continuous features to double precision should force them to > be used as continuous. > > Best, > Rahul > > > On Mon, Nov 30, 2015 at 5:46 PM, Tetsuo Kobayashi <[email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi Rahul, >> >> Thank you for your comment. It seems I need to investigate the continuous >> features more to find out what the issue is. >> >> Based on your comment, I know the madlib.forest_train() separates the >> continuous features and categorical features but are there any rules how >> the function separate the two? I see some continuous features are >> recognized as categorical features when I see cat_features in the >> output_summary table. >> Are there any ways I can manually specify what features are continuous >> and what are categorical? >> >> Thank you, >> >> Tesuo >> >> >> >> >> >> >> 2015-12-01 4:09 GMT+09:00 Rahul Iyer <[email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>>: >> >>> Hi Tetsuo, >>> >>> I don't think it's the 'id' that is causing this issue, rather the array >>> of >>> features. Decision tree combines the continuous and categorical features >>> in >>> two separate arrays - one of those (most probably the continuous feature) >>> is empty for a particular tuple. I can't comment more without looking at >>> the dataset. >>> >>> Within the array operations module, we're returning the message as >>> "array_of_bigint" for a float array. That's a minor messaging bug; I'll >>> fix >>> that as part of the next commit. >>> >>> Best, >>> Rahul >>> >>> On Sun, Nov 29, 2015 at 12:41 AM, Tetsuo Kobayashi < >>> [email protected] >>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> >>> wrote: >>> >>> > Hi, >>> > >>> > I am currently having an error with the MADlib Random Forest function >>> in >>> > MADlib1.8.0. Below is the code I tried. >>> > >>> > DROP TABLE IF EXISTS rf_output, rf_output_group, rf_output_summary; >>> > SELECT madlib.forest_train('test_rf_data', -- input table name >>> > 'rf_output', -- output table name >>> > 'id', -- id column >>> > 'duration', -- dependent variable >>> > '*', -- list of features >>> > NULL,-- exclude columns >>> > 'linkid' -- grouping column >>> > ,2::integer -- # of trees >>> > ,5::integer, -- # of random features >>> > TRUE::boolean, -- importance >>> > 1, -- # of permutations >>> > 5, -- max_tree_depth >>> > 10, -- min_split >>> > 3, -- min_bucket >>> > 10 -- number of splits per continuous >>> variable >>> > ); >>> > >>> > When I tried this with all linkid (the grouping column with 362 >>> linkids), >>> > I got an error as in "error_random_forest.txt" attached here. The error >>> > message is says I have the invalid array length but does not tell any >>> > details what features in the data have this issue. >>> > >>> > ERROR: plpy.SPIError: invalid array length (plpython.c:4648) >>> > DETAIL: array_of_bigint: Size should be in [1, 1e7], 0 given >>> > >>> > I guessed this is the error for the bigint columns but the only bigint >>> > columns is the "id" column. I once had an error that some features have >>> > identical values in all records, but it is not the case this time >>> because I >>> > changed the sample size for each linkid as 1000 or above. >>> > It seems something is zero from the DETAIL saying "0 given" but I have >>> no >>> > idea what in the data this is referring to. >>> > >>> > >>> > The schema of the input table is as below; >>> > CREATE TABLE input_table ( >>> > id bigint, >>> > linkid varchar(32), >>> > duration double precision, >>> > sat_flg int, >>> > sun_flg int, >>> > holiday_flg int, >>> > semi_holiday_flg int, >>> > renkyu_flg int, >>> > ave_temp numeric, >>> > ave_wind numeric, >>> > precip numeric, >>> > radiation numeric, >>> > ave_speed numeric, >>> > travel_time numeric, >>> > ); >>> > >>> > Can anybody please let me know what the possible cause of this error? >>> The >>> > MADlib linear regression worked without any problems. >>> > >>> > I am using MADlib 1.8.0 on GPDB 4.3.6.1. The OS is CentOS. >>> > >>> > >>> > Thank you, >>> > >>> > Tetsuo >>> > >>> >> >> >> >> -- >> ---------------------------------------- >> Pivotalジャパン株式会社 >> 小林哲郎 (Tetsuo Kobayashi) >> Senior Data Scientist >> E-mail: [email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');> >> TEL: 080-9979-0757(携帯) >> ---------------------------------------- >> > > -- ---------------------------------------- Pivotalジャパン株式会社 小林哲郎 (Tetsuo Kobayashi) Senior Data Scientist E-mail: [email protected] TEL: 080-9979-0757(携帯) ----------------------------------------
