GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/22630

    [SPARK-25497][SQL] Limit operation within whole stage codegen should not 
consume all the inputs

    ## What changes were proposed in this pull request?
    
    This PR is inspired by https://github.com/apache/spark/pull/22524, but 
picks a more aggressive fix.
    
    The current limit whole stage codegen has 2 problems:
    1. It's only applied to `InputAdapter`, many leaf nodes can't stop earlier 
w.r.t. limit.
    2. It needs to override a method, which will break if we have more than one 
limit in the whole-stage.
    
    The first problem is easy to fix, just figure out which nodes can stop 
earlier w.r.t. limit, and update them. This PR updates `RangeExec`, 
`ColumnarBatchScan`, `SortExec`, `HashAggregateExec` and `SortMergeJoinExec`.
    
    The second problem is hard to fix. This PR proposes to propagate the limit 
counter variable name upstream, so that the upstream leaf/blocking nodes can 
check the limit counter and quit the loop earlier.
    
    For better performance, the implementation here follows 
`CodegenSupport.needStopCheck`, so that we only codegen the check only if there 
is limit in the query. For columnar node like range, we check the limit counter 
per-batch instead of per-row, to make the inner loop tight and fast.
    
    ## How was this patch tested?
    
    a new test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark limit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22630.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22630
    
----
commit d9b54d5c6edd4f5337efb2d185dbb58f33972616
Author: Wenchen Fan <wenchen@...>
Date:   2018-10-03T00:00:54Z

    Limit operation within whole stage codegen should not consume all the inputs

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to