GitHub user viirya reopened a pull request:

    https://github.com/apache/spark/pull/14847

    [SPARK-17254][SQL] Add StopAfter physical plan for the filtering that can 
be stopped early

    ## What changes were proposed in this pull request?
    
    This is motivated by:
    
    From 
https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
    
    For some data which are sorted such as bucketed table with SORT BY, the 
filtering on it can be stopped early if the filtering condition satisfies some 
requirements. For example, the filtering like “WHERE id < 10” can be 
stopped as soon as the first row on which the id column is with values equal to 
or more than 10 is seen. I.e., once the condition is false, the filtering can 
be stopped.
    
    This pr adds a new `StopAfterExec` physical plan to achieve this.
    
    Design document: 
https://issues.apache.org/jira/secure/attachment/12833704/stop-after-physical-plan.pdf
    
    ### Benchmark
    
        val N = 500L << 18
        val df = sparkSession.range(N).sort("id").persist()
        runBenchmark("range/sort/filter/sum", N) {
          df.filter("id < 10").groupBy().sum().collect()
        }
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.20-moby
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           266 /  389        493.4  
         2.0       1.0X
        range/sort/filter/sum wholestage on            170 /  208        771.9  
         1.3       1.6X
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_102-b14 on Linux 4.4.20-moby
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           141 /  207        928.4  
         1.1       1.0X
        range/sort/filter/sum wholestage on             49 /   67       2694.1  
         0.4       2.9X
    
    ## How was this patch tested?
    
    Jenkins tests. Manually test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 filter-stop-early

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14847
    
----
commit ccac04f1e788dfb278618a10ee9220c89df6a61d
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-08-26T12:49:54Z

    Filter can stop when the condition is false if the child output is sorted.

commit fd365cde14a6a94c33418522d300dca43a94ad92
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-08-26T12:49:54Z

    Add StopAfter physical plan.

commit 293474a6a9229630e4e31e51f07c7b8821228f41
Author: Liang-Chi Hsieh <vii...@gmail.com>
Date:   2016-10-14T05:36:00Z

    rename StopAfter to StopAfterExec.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to