xudong963 commented on PR #18868:
URL: https://github.com/apache/datafusion/pull/18868#issuecomment-3669612233

   Hey @alamb, I rethought the issue:
   
   The key is that we need to accurately identify _pure_ LIMIT queries (those 
without an ORDER BY clause) to safely apply Limit Pruning.
   
   Generally, if an `ORDER BY` exists, the `LIMIT` cannot be pushed down to the 
Parquet level. Consequently, we often determine whether to execute Limit 
Pruning _based on whether a limit is set within the ParquetOpener_.
   
   `SELECT * FROM t WHERE b > 10 ORDER BY a LIMIT 10`❌ (Pruning disabled)
   
   `SELECT * FROM t WHERE b > 10 LIMIT 10` ✅ (Pruning enabled)
   
   However, a **complication arises**: if the data distribution and physical 
ordering of table t already satisfy `ORDER BY a`, the `EnforceSorting` phase in 
the Physical Optimizer will _remove_ the Sort node. Subsequently, the `LIMIT` 
is pushed down to the `DataSource` during the `LimitPushdown` phase.
   
   In this scenario, we cannot rely _solely_ on the presence of the `LIMIT` at 
the Parquet level to decide whether to prune. If we prune based on a limit that 
was originally associated with a removed Sort, we might _violate the required 
global ordering_.
   
   **Proposed Solution**: To address this _fundamentally_, when a Sort node is 
removed because the distribution already matches the requirements, we should 
mark the resulting Limit node as "**Order-Sensitive**." During the subsequent 
`LimitPushdown`, we can detect this flag and set the `preserve_order` attribute 
to true in the ParquetOpener. This ensures that Limit Pruning is bypassed when 
the global order must be maintained.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to