[GitHub] spark pull request #14847: [SPARK-17254][SQL] Filter can stop when the condi...

viirya Tue, 11 Oct 2016 00:41:57 -0700

GitHub user viirya reopened a pull request:

    https://github.com/apache/spark/pull/14847


    [SPARK-17254][SQL] Filter can stop when the condition is false if the child 
output is sorted

    ## What changes were proposed in this pull request?
    
    From 
https://issues.apache.org/jira/secure/attachment/12778890/BucketedTables.pdf:
    
    Filter on sorted data
    
    If the data is sorted by a key, filters on the key could stop as soon as 
the data is out of range. For example, WHERE ticker_id < âFâ should stop as 
soon as the first row starting with âFâ is seen. This can be done adding a 
Filter operator that has âstop if falseâ semantics. This is generally 
useful.
    
    ### Benchmark
    
        val N = 500L << 18
        val df = sparkSession.range(N).sort("id").persist()
        runBenchmark("range/sort/filter/sum", N) {
          df.filter("id < 10").groupBy().sum().collect()
        }
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-64-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           573 /  665        228.6  
         4.4       1.0X
        range/sort/filter/sum wholestage on            273 /  363        479.6  
         2.1       2.1X
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-64-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        range/sort/filter/sum:                   Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        range/sort/filter/sum wholestage off           325 /  424        403.7  
         2.5       1.0X
        range/sort/filter/sum wholestage on            254 /  289        516.2  
         1.9       1.3X
    
    This patch can improve the wholestage off case significantly. For the 
wholestage on case, looks like no significant difference. It might due to that 
in wholestage on case we use `continue` to skip the filtering, but it still 
needs to iterate all the rows.
    
    ## How was this patch tested?
    
    Jenkins tests. Manually test.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 filter-stop-early

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14847.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14847
    
----
commit ccac04f1e788dfb278618a10ee9220c89df6a61d
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-08-26T12:49:54Z

    Filter can stop when the condition is false if the child output is sorted.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14847: [SPARK-17254][SQL] Filter can stop when the condi...

Reply via email to