GitHub user viirya reopened a pull request:

    https://github.com/apache/spark/pull/14847

    [SPARK-17254][SQL] Filter can stop when the condition is false if the child 
output is sorted

    ## What changes were proposed in this pull request?
    
    From 
https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
    
    Filter on sorted data
    
    If the data is sorted by a key, filters on the key could stop as soon as 
the data is out of range. For example, WHERE ticker_id < “F” should stop as 
soon as the first row starting with “F” is seen. This can be done adding a 
Filter operator that has “stop if false” semantics. This is generally 
useful.
    
    ### Benchmark
    
        val N = 500L << 18
        val df = sparkSession.range(N).sort("id").persist()
        runBenchmark("range/sort/filter/sum", N) {
          df.filter("id < 10").groupBy().sum().collect()
        }
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-64-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           573 /  665        228.6  
         4.4       1.0X
        range/sort/filter/sum wholestage on            273 /  363        479.6  
         2.1       2.1X
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-64-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           325 /  424        403.7  
         2.5       1.0X
        range/sort/filter/sum wholestage on            254 /  289        516.2  
         1.9       1.3X
    
    This patch can improve the wholestage off case significantly. For the 
wholestage on case, looks like no significant difference. It might due to that 
in wholestage on case we use `continue` to skip the filtering, but it still 
needs to iterate all the rows.
    
    ## How was this patch tested?
    
    Jenkins tests. Manually test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 filter-stop-early

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14847
    
----
commit ccac04f1e788dfb278618a10ee9220c89df6a61d
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-08-26T12:49:54Z

    Filter can stop when the condition is false if the child output is sorted.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to