[
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129235#comment-16129235
]
Frank McQuillan commented on MADLIB-1119:
-----------------------------------------
In terms of user docs, may I suggest:
1) rename “Model Evaluation” to “Model Selection”
2) change the description from
“Contains functions for evaluating accuracy and validation of predictive
methods.”
to
“Contains functions for model selection and model evaluation.”
3) put “Train-Test Split” under “Model Selection” and not under “Sampling”
> Train-test split
> ----------------
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Sampling
> Reporter: Frank McQuillan
> Fix For: v1.12
>
>
> Context
> See related story on stratified sampling
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets
> including grouping support, so that I use the result sets for model
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate
> tables
> Proposed Interface
> {code}
> train_test_split (
> source_table,
> output_table,
> train_proportion,
> test_proportion, -- optional
> grouping_col -- optional
> with_replacement, -- optional
> target_cols -- optional
> separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table. A new INTEGER column on the right
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE,
> in which case two output tables will be created using
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the columns present in the source
> table unless otherwise specified in the 'target_cols' parameter below.
> train_proportion
> FLOAT8 in the range (0,1). Proportion of the dataset to include
> in the train split. If the 'grouping_col' parameter is specified below,
> each group will be sampled independently using the
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1). Proportion of the dataset to include
> in the test split. Default is the complement to the train
> proportion (1-'train_proportion'). If the 'grouping_col'
> parameter is specified below, each group will be sampled
> independently using the train proportion,
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
> that defines how to stratify. When this parameter is NULL,
> the train-test split is not stratified.
> with_replacement (optional)
> BOOLEAN, default FALSE. Determines whether to sample with replacement
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the
> 'output_table'.
> If NULL, all columns from the 'source_table' will appear in the
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE. If TRUE, two output tables will be created using
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2].
>
> 2) From Rahul Iyer: "The goal of having both train and test is to provide
> subsample and train/test split in one function.
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed
> data will be output. This is tremendously useful in situations where a user
> wants to prototype/evaluate a couple of models on smaller iid data before
> running it on whole dataset.
> Under no circumstances would the train_size + test_size be allowed to be more
> than 1. The implementation will also ensure that there are no "leaks" (leak =
> same data occurring in both train and test) as that defeats the whole purpose
> of building an independent dataset for model evaluation.
> Of course, the interface does get a little complex and could confuse users.
> Explanatory documentation with examples is the only solution to that problem.
> The alternative to having both sizes in one function is to run a subsample
> function (using various sampling methods) and then perform the train_test
> split. The downside to this approach is it requires writing an intermediate
> table to disk (inefficient). "
> Acceptance
> 1) Code, user docs, on-line docs, IC, Tinc tests complete.
> 2) Radar green for all supported dbs.
> References
> [1] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html
> [2] Related story on stratified sampling
> https://issues.apache.org/jira/browse/MADLIB-986
> [3] General
> https://en.wikipedia.org/wiki/Test_set
> [4] scikit-learn
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)