Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
@scwf Even it works finally, I don't think it is better in performance.
Simply calculate it. Assume the limit number is `n`, partition number is
`N`, and each partition has `n / r` rows in average.
For this change, in a worse case, let suppose the first scan partitioned
returns 0 rows, then we quadruple the partitions to scan and each partition
returns `n / r` rows. So we totally scan `4 * n / r + n = n * (4 + r) / r` rows
in the end.
Suppose you can know how many elements in each partition to retrieve back
to single partition for global limit operation. You need to produce all rows in
all partitions `N * n / r` + shuffling `n` rows to single partition.
If we don't consider shuffling cost. So compare `n * (4 + r) / r` and `N *
n / r`, your solution scans less rows only if `N < 4 + r`. For example, if each
partition has `n / 2` rows, `N` must less than `6`. So your solution will only
perform better when the partition number is small relatively.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]