Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/15596
  
    For a bit of background on `CollectLimitExec`, note that the intent of this 
operator is to be able to optimize the special case where a limit is the 
terminal operator of a plan: in this case, we don't need to perform a shuffle 
because we can have the driver run multiple jobs which scan an increasingly 
large portion of the RDD to get the limited items; in a nutshell, the goal is 
to allow a logic similar to the RDD `take()` action to stop early without 
having to compute all partitions of the limited RDD.
    
    In typical operation, we don't necessarily expect `CollectLimitExec` to 
appear in the middle of a query plan, so `CollectLimitExec.execute()` should 
generally only be called in special cases such as calling `.rdd()` on a limited 
RDD then performing further operations on it. This is why I didn't use 
`EnsureRequirements` here: if we did, then we'd end up shuffling all limited 
partitions to a single non-driver partition, then limiting that and collecting 
to the driver, degrading performance in the case where a limit is the terminal 
operation in the query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to