[ 
https://issues.apache.org/jira/browse/MADLIB-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15987837#comment-15987837
 ] 

Frank McQuillan commented on MADLIB-1095:
-----------------------------------------

If I put NULLS in every feature vector, DT runs and processes 14 lines which 
seems fine.

{code:sql}
DROP TABLE IF EXISTS dt_golf_nulls;
CREATE TABLE dt_golf_nulls (
    id integer NOT NULL,
    outlook text,
    temperature double precision,
    humidity double precision,
    windy text,
    class text
) ;
INSERT INTO dt_golf_nulls (id,outlook,temperature,humidity,windy,class) VALUES
(1, NULL, 85, 85, 'false', 'Don''t Play'),
(2, NULL, 80, 90, 'true', 'Don''t Play'),
(3, 'overcast', 83, NULL, 'false', 'Play'),
(4, 'rain', 70, NULL, 'false', 'Play'),
(5, NULL, 68, 80, 'false', 'Play'),
(6, 'rain', NULL, 70, 'true', 'Don''t Play'),
(7, 'overcast', 64, NULL, 'true', 'Play'),
(8, 'sunny', 72, NULL, 'false', 'Don''t Play'),
(9, NULL, 69, 70, 'false', 'Play'),
(10, NULL, 75, 80, 'false', 'Play'),
(11, 'sunny', 75, 70, NULL, 'Play'),
(12, 'overcast', NULL, 90, 'true', 'Play'),
(13, 'overcast', NULL, 75, 'false', 'Play'),
(14, 'rain', 71, NULL, 'true', 'Don''t Play');
DROP TABLE IF EXISTS train_output, train_output_summary;
SELECT madlib.tree_train('dt_golf_nulls',         -- source table
                         'train_output',    -- output model table
                         'id',              -- id column
                         'class',           -- response
                         'outlook, temperature, humidity, windy',   -- features
                         NULL::text,        -- exclude columns
                         'gini',            -- split criterion
                         NULL::text,        -- no grouping
                         NULL::text,        -- no weights
                         5,                 -- max depth
                         3,                 -- min split
                         1,                 -- min bucket
                         6            -- number of bins per continuous variable
                         );
SELECT * FROM train_output_summary;
{code}
produces
{code:sql}
-[ RECORD 1 ]---------+-----------------------------------------------
method                | tree_train
is_classification     | t
source_table          | dt_golf_nulls
model_table           | train_output
id_col_name           | id
dependent_varname     | class
independent_varnames  | outlook, windy, temperature, humidity
cat_features          | outlook,windy
con_features          | temperature,humidity
grouping_cols         | 
num_all_groups        | 1
num_failed_groups     | 0
total_rows_processed  | 14
total_rows_skipped    | 0
dependent_var_levels  | "Don't Play","Play"
dependent_var_type    | text
input_cp              | 0
independent_var_types | text, text, double precision, double precision
{code}

> Use populated parts of feature vector even if it contains one or more NULL 
> entries
> ----------------------------------------------------------------------------------
>
>                 Key: MADLIB-1095
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1095
>             Project: Apache MADlib
>          Issue Type: Bug
>          Components: Module: Decision Tree
>            Reporter: Frank McQuillan
>            Assignee: Rahul Iyer
>            Priority: Minor
>             Fix For: v1.11
>
>
> Context 
> Currently in DT/RF if the feature vector contains any NULLs, the whole row 
> will be ignored in the training data.  This is not ideal, especially in the 
> case where training data is sparse.
> Story
> As a data scientist, I want the DT/RF modules to use the non-NULL parts of 
> the feature vector, and not discard the whole row, so that I can get better 
> accuracy for classification/regression in the case of sparse data.
> Acceptance
> TBD



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to