[
https://issues.apache.org/jira/browse/MADLIB-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135498#comment-16135498
]
Frank McQuillan commented on MADLIB-1119:
-----------------------------------------
3) The user doc changes have not been updated that I attached above
4) Left panel of the user docs is still called "Test Train Split" but should be
called "Train-Test Split"
Also in the main page file:
* v1.11 missing from http://madlib.incubator.apache.org/docs/latest/
* v1.10.0 link does not work
5) Train and test proportions are reversed in the case where
`separate_output_tables = FALSE` or are the user docs wrong?
6) For the same proportions for train and test with no stratification, how
come the counts are different? (i.e., 2 vs 3)
{code}
DROP TABLE IF EXISTS out;
SELECT madlib.train_test_split(
'test', -- Source table
'out', -- Output table
0.1, -- Sample proportion
0.1, -- Sample proportion
NULL, -- Strata definition
'id1,id2', -- Columns to output
FALSE, -- Sample without replacement
FALSE); -- Yes separate output tables
SELECT * FROM out;
{code}
produces
{code}
id1 | id2 | split
-----+-----+-------
70 | 70 | 0
10 | 10 | 1
60 | 60 | 1
9 | 0 | 0
9 | 0 | 0
{code}
7) Please do some detailed testing on the functionality of this module.
> Train-test split
> ----------------
>
> Key: MADLIB-1119
> URL: https://issues.apache.org/jira/browse/MADLIB-1119
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Sampling
> Reporter: Frank McQuillan
> Assignee: Orhan Kislal
> Fix For: v1.12
>
> Attachments: test_train_split.sql_in
>
>
> Context
> See related story on stratified sampling
> https://issues.apache.org/jira/browse/MADLIB-986
> Story
> As a data scientist, I want to split a data table into training and test sets
> including grouping support, so that I use the result sets for model
> development in the usual way.
> The MVP for this story is:
> * support split by group
> * allow option to sample without replacement (default) and sample with
> replacement
> * allow option to output a subset of columns to the output table
> * output one table with a new test/train column, or optionally two separate
> tables
> Proposed Interface
> {code}
> train_test_split (
> source_table,
> output_table,
> train_proportion,
> test_proportion, -- optional
> grouping_col -- optional
> with_replacement, -- optional
> target_cols -- optional
> separate_output_tables -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table. A new INTEGER column on the right
> called 'split' will identify 1 for train set and 0 for test set,
> unless the 'separate_output_tables' parameter below is TRUE,
> in which case two output tables will be created using
> the 'output_table' name with the suffixes '_train' and '_test'.
> The output table contains all the columns present in the source
> table unless otherwise specified in the 'target_cols' parameter below.
> train_proportion
> FLOAT8 in the range (0,1). Proportion of the dataset to include
> in the train split. If the 'grouping_col' parameter is specified below,
> each group will be sampled independently using the
> train proportion, i.e., in a stratified fashion.
> test_proportion (optional)
> FLOAT8 in the range (0,1). Proportion of the dataset to include
> in the test split. Default is the complement to the train
> proportion (1-'train_proportion'). If the 'grouping_col'
> parameter is specified below, each group will be sampled
> independently using the train proportion,
> i.e., in a stratified fashion.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
> that defines how to stratify. When this parameter is NULL,
> the train-test split is not stratified.
> with_replacement (optional)
> BOOLEAN, default FALSE. Determines whether to sample with replacement
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the
> 'output_table'.
> If NULL, all columns from the 'source_table' will appear in the
> 'output_table'.
> separate_output_tables (optional)
> BOOLEAN, default FALSE. If TRUE, two output tables will be created using
> the 'output_table' name with the suffixes '_train' and '_test'.
> {code}
> Other notes
> 1) PDL tools is one example implementation of train/test split to review [2].
>
> 2) From Rahul Iyer: "The goal of having both train and test is to provide
> subsample and train/test split in one function.
> For eg. if train_size = 0.4 and test_size = 0.1, then only half the inputed
> data will be output. This is tremendously useful in situations where a user
> wants to prototype/evaluate a couple of models on smaller iid data before
> running it on whole dataset.
> Under no circumstances would the train_size + test_size be allowed to be more
> than 1. The implementation will also ensure that there are no "leaks" (leak =
> same data occurring in both train and test) as that defeats the whole purpose
> of building an independent dataset for model evaluation.
> Of course, the interface does get a little complex and could confuse users.
> Explanatory documentation with examples is the only solution to that problem.
> The alternative to having both sizes in one function is to run a subsample
> function (using various sampling methods) and then perform the train_test
> split. The downside to this approach is it requires writing an intermediate
> table to disk (inefficient). "
> Acceptance
> 1) Code, user docs, on-line docs, IC, Tinc tests complete.
> 2) Radar green for all supported dbs.
> References
> [1] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__train__test__split.html
> [2] Related story on stratified sampling
> https://issues.apache.org/jira/browse/MADLIB-986
> [3] General
> https://en.wikipedia.org/wiki/Test_set
> [4] scikit-learn
> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)