[ 
https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank McQuillan updated MADLIB-986:
-----------------------------------
    Description: 
Story

As a data scientist, I want to sample a data table in proportion to the number 
of rows in each group, so that I can do model building on the sampled data sets.

The MVP for this story is:
* sample proportion is global, i.e., single fractional value between 0 and 1
* allow option to sample without replacement (default) and sample with 
replacement
* allow option to output a subset of columns to the output table

Proposed Interface
{code}
stratified_sample ( 
                                   source_table,    
                                   output_table,
                                   proportion,
                                   grouping_col -- optional
                                   with_replacement, -- optional
                                   target_cols -- optional
                                )

source_table
TEXT. The name of the table containing the input data.

output_table
TEXT. Name of output table that contains the sampled data. 
The output table contains all the columns present in the source table 
unless otherwise specified in the 'target_cols' parameter below.

proportion
FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
be taken in proportion to the size of the stratum. 

grouping_col (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns
 that defines how to stratify.  When this parameter is NULL, 
no grouping is used so the sampling is non-stratified.

with_replacement (optional) 
BOOLEAN, default FALSE.  Determines whether to sample with replacement 
or without replacement (default).

target_cols (optional)
TEXT, default NULL. A comma-separated list of columns to appear in the 
'output_table'. 
If NULL, all columns from the 'source_table'  will appear in the 'output_table'.
{code}
Other notes

PDL tools is one example implementation of stratified sampling to review [2].  

Please review existing MADlib sample functions [3] to see if these can be used 
as a basis, or built on, for this stratified sample story. 

References

[2] PDL tools sampling modules incl stratified sampling
http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html

[3] Existing MADlib sample function
http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html

[4] Pandas/Selecting Random Samples
http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples

[5] General
https://en.wikipedia.org/wiki/Stratified_sampling

  was:
Story

As a data scientist, I want to sample a data table in proportion to the number 
of rows in each group, so that I can do model building on the sampled data sets.

The MVP for this story is:
* sample proportion is global, i.e., single fractional value between 0 and 1
* allow option to sample without replacement (default) and sample with 
replacement
* allow option to output a subset of columns to the output table

Proposed Interface
{code}
stratified_sample ( 
                                   source_table,    
                                   output_table,
                                   proportion,
                                   grouping_col -- optional
                                   with_replacement, -- optional
                                   target_cols -- optional
                                )

source_table
TEXT. The name of the table containing the input data.

output_table
TEXT. Name of output table that contains the sampled data. 
The output table contains all the columns present in the source table 
unless otherwise specified in the 'target_cols' parameter below.

proportion
FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
be taken in proportion to the size of the stratum. 

grouping_col (optional)
TEXT, default: NULL. A single column or a list of comma-separated columns
 that defines how to stratify.  When this parameter is NULL, 
no grouping is used so the sampling is non-stratified.

with_replacement (optional) 
BOOLEAN, default FALSE.  Determines whether to sample with replacement 
or without replacement (default).

target_cols (optional)
TEXT, default NULL. A comma-separated list of columns to appear in the 
'output_table'. 
If NULL, all columns from the 'source_table'  will appear in the 'output_table'.
{code}
Other notes

PDL tools is one example implementation of stratified sampling to review [2].  
Please review existing MADlib sample functions [3] to see if these can be used 
as a basis, or built on, for this stratified sample story. 

References

[2] PDL tools sampling modules incl stratified sampling
http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html

[3] Existing MADlib sample function
http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html

[4] Pandas/Selecting Random Samples
http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples

[5] General
https://en.wikipedia.org/wiki/Stratified_sampling


> Stratified sampling
> -------------------
>
>                 Key: MADLIB-986
>                 URL: https://issues.apache.org/jira/browse/MADLIB-986
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>              Labels: starter
>             Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the 
> number of rows in each group, so that I can do model building on the sampled 
> data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with 
> replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample ( 
>                                    source_table,    
>                                    output_table,
>                                    proportion,
>                                    grouping_col -- optional
>                                    with_replacement, -- optional
>                                    target_cols -- optional
>                                 )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data. 
> The output table contains all the columns present in the source table 
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1).  The size of the sample in each stratum will 
> be taken in proportion to the size of the stratum. 
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
>  that defines how to stratify.  When this parameter is NULL, 
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional) 
> BOOLEAN, default FALSE.  Determines whether to sample with replacement 
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 
> 'output_table'. 
> If NULL, all columns from the 'source_table'  will appear in the 
> 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2]. 
>  
> Please review existing MADlib sample functions [3] to see if these can be 
> used as a basis, or built on, for this stratified sample story. 
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to