[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

viirya Thu, 19 Jan 2017 01:02:25 -0800

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/16633
  
    @scwf Even it works finally, I don't think it is better in performance.
    
    Simply calculate it. Assume the limit number is `n`, partition number is 
`N`, and each partition has `n / r` rows in average.
    
    For this change, in a worse case, let suppose the first scan partitioned 
returns 0 rows, then we quadruple the partitions to scan and each partition 
returns `n / r` rows. So we totally scan `4 * n / r + n = n * (4 + r) / r` rows 
in the end.
    
    Suppose you can know how many elements in each partition to retrieve back 
to single partition for global limit operation. You need to produce all rows in 
all partitions `N * n / r` + shuffling `n` rows to single partition.
    
    If we don't consider shuffling cost. So compare `n * (4 + r) / r` and `N * 
n / r`, your solution scans less rows only if `N < 4 + r`. For example, if each 
partition has `n / 2` rows, `N` must less than `6`.  So your solution will only 
perform better when the partition number is small relatively.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16633: [SPARK-19274][SQL] Make GlobalLimit without shuffling da...

Reply via email to