[jira] [Comment Edited] (MADLIB-1168) Balance datasets

Frank McQuillan (JIRA) Wed, 20 Dec 2017 10:42:26 -0800

    [ 
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298881#comment-16298881
 ]


Frank McQuillan edited comment on MADLIB-1168 at 12/20/17 6:41 PM:
-------------------------------------------------------------------

[~ssoni] 
The answer to your question depends on what the parameter 'output_table_size' 
is set to.

Sometimes the rest of the classes are left as is, sometimes they are resampled 
uniformly:
* If 'output_table_size' =NULL, leave rest of classes as is
* If 'output_table_size' = a number , rest of classes get evenly sampled as 
described in the reqts doc.

In the table on page 7 at the end of the 'Interface' section, please see rows 
7-11 which describe how to handle the rest of the classes:
https://issues.apache.org/jira/secure/attachment/12900943/MADlib_Balance_Datasets_Requirements_v2.pdf

Let me know if this is not clear or if you have a suggestion for an alternative 
approach.

Frank







was (Author: fmcquillan):
[~ssoni] 
The answer to your question depends on what the parameter 'output_table_size' 
is set to.

Sometimes the rest of the classes are left as is, sometimes they are resampled 
uniformly:
* If 'output_table_size' =NULL, leave rest of classes as is
* If 'output_table_size' = a number , rest of classes get evenly sampled as 
described in the rets doc.

In the table on page 7 at the end of the 'Interface' section, please see rows 
7-11 which describe how to handle the rest of the classes:
https://issues.apache.org/jira/secure/attachment/12900943/MADlib_Balance_Datasets_Requirements_v2.pdf

Let me know if this is not clear or if you have a suggestion for an alternative 
approach.

Frank






> Balance datasets
> ----------------
>
>                 Key: MADLIB-1168
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1168
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>            Assignee: ssoni
>             Fix For: v1.14
>
>         Attachments: MADlib Balance Datasets Requirements.pdf, 
> MADlib_Balance_Datasets_Requirements_v2.pdf
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number 
> of samples of each class is roughly the same. Highly skewed datasets, where 
> the minority is heavily outnumbered by one or more classes, have proven to be 
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset 
> this imbalance with the hope of arriving at a more robust and fair decision 
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets.  The plan for MADlib 
> in the initial phase is to offer basic functionality that can be extended in 
> later phases based on feedback from users.  
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MADLIB-1168) Balance datasets

Reply via email to