[ https://issues.apache.org/jira/browse/MADLIB-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Orhan Kislal updated MADLIB-1266: --------------------------------- Fix Version/s: v3.0 (was: v2.1) > General fit function for PL/Python > ---------------------------------- > > Key: MADLIB-1266 > URL: https://issues.apache.org/jira/browse/MADLIB-1266 > Project: Apache MADlib > Issue Type: New Feature > Components: Module: Utilities > Reporter: Frank McQuillan > Priority: Major > Fix For: v3.0 > > > Story > `As a data scientist` > I want to call a generic PL/Python UDF from SQL to fit a model > `so that` > I can use the use any code I write or Python libraries for model builing. > Interface > {code} > fit( > source_table, -- source table > model_table, -- model output table > list_of_columns, -- columns you want in > GD, could be '*' > list_of_columns_to_exclude, -- columns to explicitly exclude > fit_udf, -- plpython UDF > to fit model > fit_udf_parameters, -- parameters for UDF, if any > grouping_cols -- groups to build separate > models for (source table distributed by this grouping) > ); > {code} > Arguments > {code} > source_table > TEXT. Name of the table containing the data to load. > model_table > TEXT. Name of the table containing the model(s), with one row per group. > list_of_columns > TEXT. Comma-separated string of column names or expressions to load. > Can also be '*' implying all columns are to be loaded (except for the ones > included > in the next argument that lists exclusions). The types of the columns can be > mixed. > Array columns can also be included in the list and will be loaded as is > (i.e., not be flattened). (???) > list_of_columns_to_exclude > TEXT. Comma-separated string of column names to exclude from load. > Typically used when 'list_of_columns' is set to '*'. > fit_udf > TEXT. plpython UDF to fit model. > fit_udf_parameters (optional) > TEXT. parameters for UDF, if any > grouping_cols (optional) > TEXT, default: NULL. Comma-separated list of column names to group the data > by. > This will produce multiple models, one for each group. > {code} > Open questions > 1) Do we need separate fit functions for R and Python, or can we autodetect? > If we need separate ones, could call this module `fit_plpythonu' and the R > one would be `fit_plr`. > Notes > 1) Both keras & scikit-learn use the term `fit` which seems better than > `train`. > (We will use the term `predict` for prediction in a separate story.) > Acceptance > 1) Generate a model table for sample data set with multiple groups using a > scikit-learn model. > 2) Repeat for Keras/TF. > 3) Repeat for XGBoost. -- This message was sent by Atlassian Jira (v8.20.10#820010)