Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/15596
  
    > In typical operation, we don't necessarily expect CollectLimitExec to 
appear in the middle of a query plan, so CollectLimitExec.execute() should 
generally only be called in special cases such as calling .rdd() on a limited 
RDD then performing further operations on it. This is why I didn't use 
EnsureRequirements here: if we did, then we'd end up shuffling all limited 
partitions to a single non-driver partition, then limiting that and collecting 
to the driver, degrading performance in the case where a limit is the terminal 
operation in the query plan.
    
    Yeah, I think I understand the intent. So in the `executeCollect`, I do 
strip off the shuffling if needed.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to