adriangb commented on PR #21956:
URL: https://github.com/apache/datafusion/pull/21956#issuecomment-4455515011

   I wonder if this belongs on `try_pushdown_sort` rather than on 
`DynamicFilterPhysicalExpr`. The `PushdownSort` rule already visits `SortExec → 
DataSourceExec` (TopK or not) and forwards the required ordering to the source 
via `try_pushdown_sort(order, eq_properties)`. Today parquet's implementation 
only accepts when the request matches the file's declared natural ordering (or 
its reverse). The case this PR is targeting — request doesn't match declared 
ordering (or the table declares none), but the source can still reorder row 
groups/files by min/max stats — fits the existing `Inexact` contract: keep the 
`SortExec`, but feed it data in a better order. The rule already preserves 
`sort_exec.fetch()` on the rebuilt `SortExec`, so TopK keeps working.
   
   Concretely: in `ParquetSource::try_pushdown_sort`, after the existing 
`Exact`/reverse-`Inexact` checks fall through, instead of returning 
`Unsupported`, check whether the requested sort column exists in the file 
schema (the same check the opener fallback is already doing) and return 
`Inexact` with `sort_order_for_reorder` set. The opener already prefers 
`sort_order_for_reorder` over the dynamic filter — that whole branch in 
`opener.rs` (the `find_dynamic_filter` / `find_column_in_expr` fallback) can go 
away, along with the `sort_options`/`fetch` fields on 
`DynamicFilterPhysicalExpr` and the reordering of `with_fetch` to call 
`create_filter` after setting fetch.
   
   A few reasons I think this factoring is cleaner:
   
   - `DynamicFilterPhysicalExpr`'s job is "live threshold for runtime pruning." 
Adding sort-direction and K fields to it conflates that with "how should the 
source schedule its reads." `try_pushdown_sort` is the channel that already 
means the latter.
   - The reorder is useful even without a `LIMIT` — less sorting work and 
better locality help any `ORDER BY`. Routing through the dynamic filter ties 
the optimization to TopK; routing through sort pushdown makes it apply to any 
`SortExec → DataSource` shape.
   - The opener never actually reads `df.fetch()` — only `sort_options` and the 
child column — so the only piece of metadata being moved is one the 
sort-pushdown call already passes as its `order` argument.
   - Fewer places to maintain: no proto-roundtrip carve-out, no 
`find_dynamic_filter` tree walk under `AND`/wrappers (which is the kind of 
thing that quietly breaks the next time someone wraps the predicate), and no 
ordering coupling between `with_fetch` and `create_filter`.
   
   Happy to be wrong here — is there a case I'm missing where the sort metadata 
can reach the parquet source through the dynamic filter but *not* through 
`try_pushdown_sort`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to