Github user viirya commented on the issue:
https://github.com/apache/spark/pull/16633
That case only happens when the all row counts in all partitions are less
than or (nearly) equal to the limit number. So it needs to scan (almost) all
partitions.
One possible way to deal with this case, is to use row count statistics to
decide whether we do this global limit without shuffle, or old global limit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]