[ 
https://issues.apache.org/jira/browse/MADLIB-1168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299215#comment-16299215
 ] 

ssoni commented on MADLIB-1168:
-------------------------------

[~iyerr3][[~okislal]]
Following are a couple of potential solutions for limiting the class sizes and 
speeding random sampling query:

* *Solution 1:* Generate random row numbers and select the desired 
fraction/number of observations. This requires just 1 seq scan and may use disk 
merge sort if the desired number of observations is large. Otherwise, it uses 
quicksort.

* *Solution 2:* Use 'order by random()' with the keyword limit. This requires 
as many seq scans as the desired fraction/number specified in class_sizes. It 
is more likely to use top-n heapsort (smaller limit e.g. 1000) instead of disk 
merge sort (larger limit e.g. 100000).

According to our experiment with 10 million rows and 2 columns (1 column with 4 
groups)  both the solutions are comparable. But for smaller datasets, solution 
2 is slightly faster than solution 1.

{code}
drop table test2 cascade;
create table test2 as 
    select generate_series(1,10000000) as id, 
    ((random()*3)+1)::int as gr;
{code} 

*Solution 1:* 

Case with larger subset selection: 19-20 seconds

{code}
explain analyze
            select
                id, gr
            from
                (
                SELECT
                    *,
                    row_number() OVER(PARTITION BY gr)
                    AS __row_no
                FROM
                    test2
                ) as foo
            where
            (gr= '1' and __row_no<=100000 )
            OR
            (gr= '2' and __row_no<=200000 )
            OR
            (gr= '3' and __row_no<=300000 )
            or
            (gr not in ('1','2','3'))
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on foo  (cost=1580360.76..2042859.70 rows=3783218 width=8) 
(actual time=9186.275..19784.115 rows=2267855 loops=1)
   Filter: (((foo.gr = 1) AND (foo.__row_no <= 100000)) OR ((foo.gr = 2) AND 
(foo.__row_no <= 200000)) OR ((foo.gr = 3) AND (foo.__row_no <= 300000)) OR 
(foo.gr <> ALL ('{1,2,3}'::integer[])))
   Rows Removed by Filter: 7732145
   ->  WindowAgg  (cost=1580360.76..1755360.36 rows=9999977 width=16) (actual 
time=9186.273..16999.496 rows=10000000 loops=1)
         ->  Sort  (cost=1580360.76..1605360.71 rows=9999977 width=8) (actual 
time=9186.265..11228.217 rows=10000000 loops=1)
               Sort Key: test2.gr
               Sort Method: external merge  Disk: 176176kB
               ->  Seq Scan on test2  (cost=0.00..144247.77 rows=9999977 
width=8) (actual time=0.017..2289.247 rows=10000000 loops=1)
 Planning time: 0.210 ms
 Execution time: 20010.551 ms
(10 rows)

Time: 20011.532 ms (00:20.012)
{code}

Case with smaller subset selection: 19-20 seconds

{code}
explain analyze
            select
                id, gr
            from
                (
                SELECT
                    *,
                    row_number() OVER(PARTITION BY gr)
                    AS __row_no
                FROM
                    test2
                ) as foo
            where
            (gr= '1' and __row_no<=1000 )
            OR
            (gr= '2' and __row_no<=2000 )
            OR
            (gr= '3' and __row_no<=3000 )
            or
            (gr not in ('1','2','3'))
                                                                                
        QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on foo  (cost=1580360.76..2042859.70 rows=3783218 width=8) 
(actual time=8699.221..19181.321 rows=1673855 loops=1)
   Filter: (((foo.gr = 1) AND (foo.__row_no <= 1000)) OR ((foo.gr = 2) AND 
(foo.__row_no <= 2000)) OR ((foo.gr = 3) AND (foo.__row_no <= 3000)) OR (foo.gr 
<> ALL ('{1,2,3}'::integer[])))
   Rows Removed by Filter: 8326145
   ->  WindowAgg  (cost=1580360.76..1755360.36 rows=9999977 width=16) (actual 
time=8699.218..16369.833 rows=10000000 loops=1)
         ->  Sort  (cost=1580360.76..1605360.71 rows=9999977 width=8) (actual 
time=8699.210..10701.751 rows=10000000 loops=1)
               Sort Key: test2.gr
               Sort Method: external merge  Disk: 176176kB
               ->  Seq Scan on test2  (cost=0.00..144247.77 rows=9999977 
width=8) (actual time=0.017..2113.890 rows=10000000 loops=1)
 Planning time: 0.181 ms
 Execution time: 19471.310 ms
(10 rows)
{code}

*Solution 2*.
Case with larger subset selection: 19-20 seconds

{code}
explain analyze
select * from
(select * from test2 where gr = '1' order by random() limit 100000) as foo
UnION
(select * from test2 where gr= '2'  order by random()  limit 200000)
UNion
(select * from test2 where gr= '3'  order by random()  limit 300000)
UNiON
select * from test2 where gr not in ('1','2','3');
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2133228.64..2150256.12 rows=2270330 width=8) (actual 
time=19347.622..20610.326 rows=2267855 loops=1)
   ->  Sort  (cost=2133228.64..2138904.47 rows=2270330 width=8) (actual 
time=19347.620..19854.822 rows=2267855 loops=1)
         Sort Key: foo.id, foo.gr
         Sort Method: external merge  Disk: 39976kB
         ->  Append  (cost=320764.65..1831461.56 rows=2270330 width=8) (actual 
time=3527.894..18066.441 rows=2267855 loops=1)
               ->  Subquery Scan on foo  (cost=320764.65..322014.65 rows=100000 
width=8) (actual time=3527.894..3593.640 rows=100000 loops=1)
                     ->  Limit  (cost=320764.65..321014.65 rows=100000 
width=16) (actual time=3527.891..3573.712 rows=100000 loops=1)
                           ->  Sort  (cost=320764.65..324947.97 rows=1673329 
width=16) (actual time=3527.890..3561.467 rows=100000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: external merge  Disk: 42464kB
                                 ->  Seq Scan on test2  (cost=0.00..173431.04 
rows=1673329 width=16) (actual time=0.031..1786.680 rows=1667258 loops=1)
                                       Filter: (gr = 1)
                                       Rows Removed by Filter: 8332742
               ->  Subquery Scan on "*SELECT* 2"  (cost=652508.36..655008.36 
rows=200000 width=8) (actual time=5818.155..5947.836 rows=200000 loops=1)
                     ->  Limit  (cost=652508.36..653008.36 rows=200000 
width=16) (actual time=5818.152..5907.384 rows=200000 loops=1)
                           ->  Sort  (cost=652508.36..660839.18 rows=3332326 
width=16) (actual time=5818.150..5882.304 rows=200000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: external merge  Disk: 84848kB
                                 ->  Seq Scan on test2 test2_1  
(cost=0.00..177578.53 rows=3332326 width=16) (actual time=0.025..2010.535 
rows=3332151 loops=1)
                                       Filter: (gr = 2)
                                       Rows Removed by Filter: 6667849
               ->  Subquery Scan on "*SELECT* 3"  (cost=651237.57..654987.57 
rows=300000 width=8) (actual time=5798.621..5985.493 rows=300000 loops=1)
                     ->  Limit  (cost=651237.57..651987.57 rows=300000 
width=16) (actual time=5798.618..5927.191 rows=300000 loops=1)
                           ->  Sort  (cost=651237.57..659547.55 rows=3323992 
width=16) (actual time=5798.617..5890.713 rows=300000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: external merge  Disk: 84856kB
                                 ->  Seq Scan on test2 test2_2  
(cost=0.00..177557.69 rows=3323992 width=16) (actual time=0.021..1990.768 
rows=3332736 loops=1)
                                       Filter: (gr = 3)
                                       Rows Removed by Filter: 6667264
               ->  Seq Scan on test2 test2_3  (cost=0.00..181747.68 
rows=1670330 width=8) (actual time=0.028..2267.051 rows=1667855 loops=1)
                     Filter: (gr <> ALL ('{1,2,3}'::integer[]))
                     Rows Removed by Filter: 8332145
 Planning time: 0.344 ms
 Execution time: 20759.252 ms
(34 rows)

Time: 20760.563 ms (00:20.761)

{code}

Case with smaller subset selection: 13-14 seconds

{code}
explain analyze
select * from
(select * from test2 where gr = '1' order by random() limit 1000) as foo
UnION
(select * from test2 where gr= '2'  order by random()  limit 2000)
UNion
(select * from test2 where gr= '3'  order by random()  limit 3000)
UNiON
select * from test2 where gr not in ('1','2','3');

                                                                          QUERY 
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1423040.27..1435612.75 rows=1676330 width=8) (actual 
time=12549.205..13420.110 rows=1673855 loops=1)
   ->  Sort  (cost=1423040.27..1427231.10 rows=1676330 width=8) (actual 
time=12549.194..12844.827 rows=1673855 loops=1)
         Sort Key: foo.id, foo.gr
         Sort Method: external merge  Disk: 29496kB
         ->  Append  (cost=265177.86..1226812.44 rows=1676330 width=8) (actual 
time=2373.069..10965.185 rows=1673855 loops=1)
               ->  Subquery Scan on foo  (cost=265177.86..265190.36 rows=1000 
width=8) (actual time=2373.068..2373.567 rows=1000 loops=1)
                     ->  Limit  (cost=265177.86..265180.36 rows=1000 width=16) 
(actual time=2373.066..2373.368 rows=1000 loops=1)
                           ->  Sort  (cost=265177.86..269361.18 rows=1673329 
width=16) (actual time=2373.065..2373.223 rows=1000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: top-N heapsort  Memory: 141kB
                                 ->  Seq Scan on test2  (cost=0.00..173431.04 
rows=1673329 width=16) (actual time=0.022..1884.731 rows=1667258 loops=1)
                                       Filter: (gr = 1)
                                       Rows Removed by Filter: 8332742
               ->  Subquery Scan on "*SELECT* 2"  (cost=376948.00..376973.00 
rows=2000 width=8) (actual time=2978.519..2979.438 rows=2000 loops=1)
                     ->  Limit  (cost=376948.00..376953.00 rows=2000 width=16) 
(actual time=2978.517..2979.056 rows=2000 loops=1)
                           ->  Sort  (cost=376948.00..385278.81 rows=3332326 
width=16) (actual time=2978.516..2978.830 rows=2000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: top-N heapsort  Memory: 280kB
                                 ->  Seq Scan on test2 test2_1  
(cost=0.00..177578.53 rows=3332326 width=16) (actual time=0.024..2034.872 
rows=3332151 loops=1)
                                       Filter: (gr = 2)
                                       Rows Removed by Filter: 6667849
               ->  Subquery Scan on "*SELECT* 3"  (cost=386150.60..386188.10 
rows=3000 width=8) (actual time=3098.020..3099.406 rows=3000 loops=1)
                     ->  Limit  (cost=386150.60..386158.10 rows=3000 width=16) 
(actual time=3098.018..3098.834 rows=3000 loops=1)
                           ->  Sort  (cost=386150.60..394460.58 rows=3323992 
width=16) (actual time=3098.017..3098.491 rows=3000 loops=1)
                                 Sort Key: (random())
                                 Sort Method: top-N heapsort  Memory: 468kB
                                 ->  Seq Scan on test2 test2_2  
(cost=0.00..177557.69 rows=3323992 width=16) (actual time=0.023..2117.887 
rows=3332736 loops=1)
                                       Filter: (gr = 3)
                                       Rows Removed by Filter: 6667264
               ->  Seq Scan on test2 test2_3  (cost=0.00..181747.68 
rows=1670330 width=8) (actual time=0.021..2314.743 rows=1667855 loops=1)
                     Filter: (gr <> ALL ('{1,2,3}'::integer[]))
                     Rows Removed by Filter: 8332145
 Planning time: 0.295 ms
 Execution time: 13525.515 ms
(34 rows)

{code}

> Balance datasets
> ----------------
>
>                 Key: MADLIB-1168
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1168
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module: Sampling
>            Reporter: Frank McQuillan
>            Assignee: ssoni
>             Fix For: v1.14
>
>         Attachments: MADlib Balance Datasets Requirements.pdf, 
> MADlib_Balance_Datasets_Requirements_v2.pdf
>
>
> From [1] here is the motivation behind balancing datasets:
> “Most classification algorithms will only perform optimally when the number 
> of samples of each class is roughly the same. Highly skewed datasets, where 
> the minority is heavily outnumbered by one or more classes, have proven to be 
> a challenge while at the same time becoming more and more common.
> One way of addressing this issue is by re-sampling the dataset as to offset 
> this imbalance with the hope of arriving at a more robust and fair decision 
> boundary than you would otherwise.
> Re-sampling techniques can be divided in these categories:
> * Under-sampling the majority class(es).
> * Over-sampling the minority class.
> * Combining over- and under-sampling.
> * Create ensemble balanced sets.”
> There is an extensive literature on balancing datasets.  The plan for MADlib 
> in the initial phase is to offer basic functionality that can be extended in 
> later phases based on feedback from users.  
> Please see attached document for proposed scope of this story.
> References
> [1] imbalance-learn Python project
> http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
> https://github.com/scikit-learn-contrib/imbalanced-learn



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to