Frank McQuillan created MADLIB-1266:
---------------------------------------

             Summary: General fit function for PL/Python
                 Key: MADLIB-1266
                 URL: https://issues.apache.org/jira/browse/MADLIB-1266
             Project: Apache MADlib
          Issue Type: New Feature
          Components: Module: Utilities
            Reporter: Frank McQuillan


Story

`As a data scientist`
I want to call a generic PL/Python UDF from SQL to fit a model
`so that`
I can use the use any code I write or Python libraries for model builing.
Interface

{code}
fit(
                source_table,                   -- source table
                model_table,                            -- model output table
                list_of_columns,                        -- columns you want in 
GD, could be '*'
                list_of_columns_to_exclude, -- columns to explicitly exclude
                fit_udf,                                        -- plpython UDF 
to fit model
                fit_udf_parameters,             -- parameters for UDF, if any
                grouping_cols                   -- groups to build separate 
models for (source table distributed by this grouping)
        );
{code}

Arguments
{code}
source_table
TEXT. Name of the table containing the data to load.

model_table
TEXT. Name of the table containing the model(s), with one row per group.

list_of_columns
TEXT. Comma-separated string of column names or expressions to load. 
Can also be '*' implying all columns are to be loaded (except for the ones 
included
 in the next argument that lists exclusions). The types of the columns can be 
mixed.  
Array columns can also be included in the list and will be loaded as is (i.e., 
not be flattened). (???)

list_of_columns_to_exclude
TEXT. Comma-separated string of column names to exclude from load. 
Typically used when 'list_of_columns' is set to '*'.

fit_udf
TEXT.  plpython UDF to fit model.

fit_udf_parameters (optional)
TEXT.  parameters for UDF, if any

grouping_cols (optional)
TEXT, default: NULL. Comma-separated list of column names to group the data by. 
This will produce multiple models, one for each group.
{code}


Open questions

1) Do we need separate fit functions for R and Python, or can we autodetect?
If we need separate ones, could call this module `fit_plpythonu' and the R one 
would be `fit_plr`.


Notes

1) Both keras & scikit-learn use the term `fit` which seems better than `train`.
(We will use the term `predict` for prediction in a separate story.)


Acceptance

1) Generate a model table for sample data set with multiple groups using a 
scikit-learn model.
2) Repeat for Keras/TF.
3) Repeat for XGBoost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to