ASF GitHub Bot commented on MADLIB-1168:

GitHub user iyerr3 opened a pull request:


    Balance Sample: Add support for grouping

    JIRA: MADLIB-1168
    This commit adds grouping support for balanced sampling. 
    Grouping is implemented as a loop over the existing logic, 
    with the sampling for each group run independently. 
    Closes #239

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/madlib/madlib 

Alternatively you can review and apply these changes as the patch at:


To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #239
commit 6c5fcfb375eaf7dc68e1ede4aca2a47b8e55309b
Author: Rahul Iyer <riyer@...>
Date:   2018-02-24T02:45:32Z

    Clean code + conform to PEP8

commit a5a0c1e2c851a923b9eb550d42dfc594b4635c64
Author: Rahul Iyer <riyer@...>
Date:   2018-02-26T23:01:34Z

    Add a Collate plpy results function

commit 8e8eca2960207ca0317ded68608c660b8d4ddb55
Author: Rahul Iyer <riyer@...>
Date:   2018-03-02T00:44:54Z

    Add grouping in get_level_frequency_distribution

commit cad4a5be732f89504ff62f4d9e68367d174fc322
Author: Rahul Iyer <riyer@...>
Date:   2018-03-07T07:07:00Z

    Ensure subqueries are filtering groups and using right count

commit 39dd6f436bb9b8d505be5204226dcc3053b1b4df
Author: Rahul Iyer <riyer@...>
Date:   2018-03-07T07:07:14Z

    Update install check to include grouping

commit d61ff28290dad27ead0f1c68d740a8ccb79f4aec
Author: Rahul Iyer <riyer@...>
Date:   2018-03-07T07:07:27Z

    Update documentation with grouping examples


> Balance datasets
> ----------------
>                 Key: MADLIB-1168
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1168
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>            Assignee: ssoni
>            Priority: Major
>             Fix For: v1.14
>         Attachments: MADlib Balance Datasets Requirements.pdf, 
> MADlib_Balance_Datasets_Requirements_v2.pdf
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number 
> of samples of each class is roughly the same. Highly skewed datasets, where 
> the minority is heavily outnumbered by one or more classes, have proven to be 
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset 
> this imbalance with the hope of arriving at a more robust and fair decision 
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets.  The plan for MADlib 
> in the initial phase is to offer basic functionality that can be extended in 
> later phases based on feedback from users.  
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn

This message was sent by Atlassian JIRA

Reply via email to