[GitHub] madlib pull request #230: Balanced sets final

kaknikhil Fri, 02 Feb 2018 12:32:42 -0800

Github user kaknikhil commented on a diff in the pull request:

    https://github.com/apache/madlib/pull/230#discussion_r165505215
  
    --- Diff: src/ports/postgres/modules/sample/balance_sample.sql_in ---
    @@ -0,0 +1,355 @@
    +/* ----------------------------------------------------------------------- 
*//**
    + *
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *   http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing,
    + * software distributed under the License is distributed on an
    + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    + * KIND, either express or implied.  See the License for the
    + * specific language governing permissions and limitations
    + * under the License.
    + *
    + *
    + * @file balance_sample.sql_in
    + *
    + * @brief SQL functions for balanced data sets sampling.
    + * @date 12/14/2017
    + *
    + * @sa Given a table, balanced sampling returns a sampled data set
    + * with specified balancing for each class (defaults to uniform sampleing).
    + *
    + *//* 
----------------------------------------------------------------------- */
    +
    +m4_include(`SQLCommon.m4')
    +
    +
    +/**
    +@addtogroup grp_balance_sampling
    +
    +<div class="toc"><b>Contents</b>
    +<ul>
    +<li><a href="#strs">Balanced Data Sets Sampling</a></li>
    +<li><a href="#examples">Examples</a></li>
    +</ul>
    +</div>
    +
    +@brief A method for independently sampling classes to receive a
    +balanced data set.
    +It is commonly used to where classes have greater imbalanced ratio.
    +It ensures the subclasses are adequately represented in the sample.
    +
    +@anchor strs
    +@par Balanced Sampling
    +
    +<pre class="syntax">
    +balance_sample( source_table,
    +                output_table,
    +                class_col,
    +                class_sizes,
    +                output_table_size,
    +                grouping_cols,
    +                with_replacement,
    +                keep_null
    +              )
    +</pre>
    +
    +\b Arguments
    +<dl class="arglist">
    +<dt>source_table</dt>
    +<dd>TEXT. Name of the table containing the input data.</dd>
    +
    +<dt>output_table</dt>
    +<dd>TEXT. Name of output table that contains the sampled data.
    +The output table contains all columns present in the source
    +table.</dd>
    +
    +<dt>class_col</dt>
    +<dd>TEXT,  Name of the column containing the class to be balanced.
    +</dd>
    +
    +<dt>class_sizes (optional)</dt>
    +<dd>VARCHAR, default âuniformâ.
    +
    +@note
    +Current implementation only supports 'undersample'.
    +
    +Parameter to define the size of the different class values.
    +(Class values are sometimes also called levels).
    +
    +Can take the following forms:
    +
    +<ul>
    +<li><b>âuniformâ</b>:
    +All class values will be resampled to have the same number of rows.
    +</li>
    +<li><b>'undersample'</b>:
    +Under-sample such that all class values end up with the same number of
    +observations as the minority class.  Done without replacement by default
    +unless the parameter âwith_replacementâ is set to TRUE.
    +</li>
    +<li><b>'oversample'</b>:
    +Over-sample with replacement such that all class values end up with the
    +same number of observations as the majority class.  Not affected by the
    +parameter âwith_replacementâ since over-sampling is always done with
    +replacement.
    +</li>
    +</ul>
    +
    +You can also explicitly set class size in a string containing a
    +comma-delimited list. Order does not matter and all class values do not
    +need to be specified.  Use the format âclass_value_1=x, class_value_2=y, 
...â
    +where the class value in the list must exist in the column âclass_colâ.
    +
    +E.g.,  âmale=3000, female=7000â means you want to resample the dataset
    +to result in 3000 male and 7000 female rows in the âoutput_tableâ.
    +
    +@note
    +The allowed names for class values follows object naming rules in
    +PostgreSQL [6].  Quoted identifiers are allowed and should be enclosed
    +in double quotes â in the usual way.  If for some reason the class values
    +in the examples above were âMaLeâ and âFeMaLeâ then the comma 
delimited
    +list for âclass_sizeâ would be:  ââMaLeâ=3000, 
âFeMaLeâ=7000â.
    +</dd>
    +
    +<dt>output_table_size (optional)</dt>
    +<dd>INTEGER, default NULL.  Desired size of the output data set.
    +
    +This parameter is ignored if âclass_sizeâ parameter is set to either
    +âoversampleâ or âundersampleâ since output table size is already 
determined.
    +If NULL, the resulting output table size will depend on the settings for 
the
    +âclass_sizeâ parameter.
    +
    +Example: â3000â will result in 3000 rows in the âouput_tableâ.
    +</dd>
    +
    +<dt>grouping_cols (optional)</dt>
    +<dd>
    +TEXT, default: NULL. A single column or a list of comma-separated columns 
that
    +defines the strata.  When this parameter is NULL, no grouping is used so 
the
    +sampling is non-stratified, that is, the whole table is treated as a single
    +group.
    +
    +@note Current implementation does not support grouping_cols
    +</dd>
    +
    +<dt>with_replacement  (optional)</dt>
    +<dd>
    +BOOLEAN, default FALSE.  Determines whether to sample with replacement or
    +without replacement (default). With replacement means that it is possible 
that
    +the same row may appear in the sample set more than once. Without 
replacement
    +means a given row can be selected only once. This parameter affects
    +under-sampling only since over-sampling is always performed with 
replacement.
    +</dd>
    +
    +<dt>keep_null  (optional)</dt>
    +<dd>
    +BOOLEAN, default FALSE.  Determines whether to sample rows whose class 
values
    +are NULL. By default, all rows with NULL class values are ignored. If this 
is
    +set to TRUE, then NULL is treated as another class value.
    +</dd>
    +
    +</dl>
    +
    +@anchor examples
    +@par Examples
    +
    +Please note that due to the random nature of sampling, your
    +results may look different from those below.
    +
    +-# Create an input table:
    +<pre class="syntax">
    +DROP TABLE IF EXISTS test;
    +CREATE TABLE test(
    +    id1 INTEGER,
    +    id2 INTEGER,
    +    gr1 INTEGER,
    +    gr2 INTEGER
    +);
    +INSERT INTO test VALUES
    +(1,0,1,1),
    +(2,0,1,1),
    +(3,0,1,1),
    +(4,0,1,1),
    +(5,0,1,1),
    +(6,0,1,1),
    +(7,0,1,1),
    +(8,0,1,1),
    +(9,0,1,1),
    +(9,0,1,1),
    +(9,0,1,1),
    +(9,0,1,1),
    +(0,1,1,2),
    +(0,2,1,2),
    +(0,3,1,2),
    +(0,4,1,2),
    +(0,5,1,2),
    +(0,6,1,2),
    +(10,10,2,2),
    +(20,20,2,2),
    +(30,30,2,2),
    +(40,40,2,2),
    +(50,50,2,2),
    +(60,60,2,2),
    +(70,70,2,2);
    +</pre>
    +
    +-# Sample without replacement:
    +<pre class="syntax">
    +DROP TABLE IF EXISTS out;
    +SELECT madlib.balance_sample(
    +                              'test',     -- Source table
    +                              'out',      -- Output table
    +                              'gr1');     -- Class column
    --- End diff --
    
    can we rename all occurrences of `gr1` since it can be confused as a 
grouping column as well.

---

[GitHub] madlib pull request #230: Balanced sets final

Reply via email to