xudong963 commented on code in PR #18868:
URL: https://github.com/apache/datafusion/pull/18868#discussion_r2670997727
##########
datafusion/physical-optimizer/src/limit_pushdown.rs:
##########
@@ -337,4 +362,37 @@ fn add_global_limit(
Arc::new(GlobalLimitExec::new(pushdown_plan, skip, fetch)) as _
}
+/// Helper function to handle DataSourceExec preserve_order setting
+fn ensure_preserve_order_if_needed(
+ plan: Arc<dyn ExecutionPlan>,
+ order_sensitive: bool,
+) -> Arc<dyn ExecutionPlan> {
+ if !order_sensitive {
+ return plan;
+ }
+
+ let Some(data_source_exec) =
plan.as_any().downcast_ref::<DataSourceExec>() else {
+ return plan;
+ };
Review Comment:
We could set the preserve_order flag to true when an order was pushed down
into FileScanConfig.
But here, I'm trying to solve the case described in
https://github.com/apache/datafusion/pull/18868#issuecomment-3669612233
> However, a complication arises: if the data distribution and physical
ordering of table t already satisfy ORDER BY a, the EnforceSorting phase in the
Physical Optimizer will remove the Sort node. Subsequently, the LIMIT is pushed
down to the DataSource during the LimitPushdown phase.
> In this scenario, we cannot rely solely on the presence of the LIMIT at
the Parquet level to decide whether to prune. If we prune based on a limit that
was originally associated with a removed Sort, we might violate the required
global ordering.
> Proposed Solution: To address this fundamentally, when a Sort node is
removed because the distribution already matches the requirements, we should
mark the resulting Limit node as
["Order-Sensitive."](https://github.com/apache/datafusion/pull/18868/commits/46cf80b8b4ce49e96548b873369c66835629fbb2#diff-445a93b16e352019a131011d0574c5bc2e1ec32b830c0052876bdba76855e5f5R586-R593)
During the subsequent LimitPushdown, we can [detect this flag and set
](https://github.com/apache/datafusion/pull/18868/commits/46cf80b8b4ce49e96548b873369c66835629fbb2#diff-baa20d781df314d37cdfd47e897af90ae4aabf1aa2b892a90bf0b3bdd21e020eR253-R279)the
preserve_order attribute to true in the ParquetOpener. This ensures that Limit
Pruning is bypassed when the global order must be maintained.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]