[ 
https://issues.apache.org/jira/browse/MADLIB-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267898#comment-16267898
 ] 

Frank McQuillan commented on MADLIB-1173:
-----------------------------------------



This looks like it is working now.  Here’s a dummy case to test:

{code}
DROP TABLE IF EXISTS dt_golf;
CREATE TABLE dt_golf (
    id integer NOT NULL,
    outlook text,
    temperature double precision,
    humidity double precision,
    windy text,
    class text,
    array_vals double precision[]
) ;
INSERT INTO dt_golf (id,outlook,temperature,humidity,windy,class, array_vals) 
VALUES
(1, 'sunny', 85, 85, 'false', 'Don''t Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(2, 'sunny', 80, 90, 'true', 'Don''t Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(3, 'overcast', 83, 78, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(4, 'rain', 70, 96, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(5, 'rain', 68, 80, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(6, 'rain', 65, 70, 'true', 'Don''t Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(7, 'overcast', 64, 65, 'true', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(8, 'sunny', 72, 95, 'false', 'Don''t Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(9, 'sunny', 69, 70, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(10, 'rain', 75, 80, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(11, 'sunny', 75, 70, 'true', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(12, 'overcast', 72, 90, 'true', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(13, 'overcast', 81, 75, 'false', 'Play', ARRAY(SELECT * FROM 
generate_series(1,3000))),
(14, 'rain', 71, 80, 'true', 'Don''t Play', ARRAY(SELECT * FROM 
generate_series(1,3000)));
{code}

Train:

{code}
DROP TABLE IF EXISTS train_output, train_output_summary;
SELECT madlib.tree_train('dt_golf',         -- source table
                         'train_output',    -- output model table
                         'id',              -- id column
                         'class',           -- response
                         'outlook, temperature, windy, array_vals',   -- 
features
                         NULL::text,        -- exclude columns
                         'gini',            -- split criterion
                         NULL::text,        -- no grouping
                         NULL::text,        -- no weights
                         3,                 -- max depth
                         3,                 -- min split
                         1,                 -- min bucket
                         3                 -- number of bins per continuous 
variable
                         );
{code}

Predict:

{code}
DROP TABLE IF EXISTS prediction_results;
SELECT madlib.tree_predict('train_output',          -- tree model
                           'dt_golf',               -- new data table
                           'prediction_results',    -- output table
                           'response');             -- show prediction
SELECT g.id, class, estimated_class FROM prediction_results p, dt_golf g where 
p.id = g.id ORDER BY g.id;
{code}

produces:

{code}
 id |   class    | estimated_class 
----+------------+-----------------
  1 | Don't Play | Play
  2 | Don't Play | Play
  3 | Play       | Play
  4 | Play       | Play
  5 | Play       | Play
  6 | Don't Play | Play
  7 | Play       | Play
  8 | Don't Play | Play
  9 | Play       | Play
 10 | Play       | Play
 11 | Play       | Play
 12 | Play       | Play
 13 | Play       | Play
 14 | Don't Play | Play
(14 rows)
{code}

Before this fix, the error would have been:

{code}
InternalError: (psycopg2.InternalError) plpy.Error: Decision tree error: 
Missing columns in predict data table (dt_golf) that were used during training
CONTEXT:  Traceback (most recent call last):
  PL/Python function "tree_predict", line 19, in <module>
    return decision_tree.tree_predict(**globals())
  PL/Python function "tree_predict", line 1752, in tree_predict
  PL/Python function "tree_predict", line 75, in _assert
PL/Python function "tree_predict"
 [SQL: "SELECT madlib.tree_predict('train_output',          -- tree model\n     
                      'dt_golf',               -- new data table\n              
             'prediction_results',    -- output table\n                         
  'response');             -- show prediction"]
{code}




> DT predict fails with large feature vector arrays
> -------------------------------------------------
>
>                 Key: MADLIB-1173
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1173
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Rahul Iyer
>             Fix For: v1.13
>
>
> Since decision trees can now take in arrays, it’s possible to train a model 
> with more than 1600 (table column limit) features.  For example, we can 
> assemble a feature_array field with 2000 elements, and pass that into 
> tree_train as the independent variable.  The model trains successfully and 
> produces the expected model and model_summary tables.
>  
> But then when trying to use that model table to make predictions using 
> tree_predict, it gets the following error:
>  
> {code}
> ERROR:  plpy.Error: Decision tree error: Missing columns in predict data 
> table (tbl_test_1_data_final) that were used during training (plpython.c:4656)
> CONTEXT:  Traceback (most recent call last):
>   PL/Python function "tree_predict", line 19, in <module>
>    return decision_tree.tree_predict(**globals())
>   PL/Python function "tree_predict", line 1752, in tree_predict
>   PL/Python function "tree_predict", line 75, in _assert
> PL/Python function "tree_predict"
>  {code}
> We think what’s happening (this is only a guess) is that the tree_train 
> function correctly makes use of the passed features, even if they number in 
> excess of 1600, but tree_predict is still for some reason limiting the 
> feature count to 1600.  So when it tries to run the predictions, it sees that 
> there are 2000 expected features according to the model table, but 
> tree_predict has limited the new_data_table to only 1600 features.  It then 
> sees a discrepancy between the 2000 expected by the model and the 1600 that 
> it has perceived on the new_data_table, and gets the above error.
>  
> We’ve observed the error only happens when over 1600 features are used.  If, 
> for example, we trained a model with 900 features, it would able to predict 
> successfully.  If we train a model with 2000 features, it gets the above 
> error.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to