adriangb opened a new pull request, #22300:
URL: https://github.com/apache/datafusion/pull/22300

   ## Which issue does this PR close?
   
   - Related to #21996
   - Related to #21624
   
   This is **not** a replacement for #21996 — it is a minimal subset of it, 
carved out so the feature can be discussed/merged in smaller pieces.
   
   ## Rationale for this change
   
   #21996 ("Query-aware statistics requests via ScanArgs / ScanResult") is a 
full vertical slice: new statistics types, request threading optimizer → 
planner → provider, a built-in `RequestStatistics` optimizer rule, and a 
consumer integration (`FilePruner` / `ListingTable`).
   
   This PR extracts **only the framework hooks** — just enough that the rest 
can be implemented *entirely outside* of DataFusion. A third party can write 
their own optimizer rule to derive statistics requests, and their own 
`TableProvider` to consume them, without DataFusion shipping any rule or 
consumer of its own.
   
   In stock DataFusion nothing observable changes: no rule populates the new 
field, and the built-in providers ignore it.
   
   ## What changes are included in this PR?
   
   Five small, independently-reviewable commits:
   
   1. **`refactor: add TableScanBuilder, deprecate TableScan::try_new`** — 
`TableScan::try_new` takes five positional args and bare `TableScan { .. }` 
literals are fragile to field additions. Introduce `TableScanBuilder` (with 
`From<TableScan>`), move schema derivation into `build()`, deprecate `try_new` 
(delegates to the builder), migrate all in-tree callers. Pure refactor.
   2. **`feat: add StatisticsRequest / StatisticsValue / SatisfiedStatistics`** 
— new public vocabulary types in `datafusion-expr-common::statistics`. Nothing 
consumes them yet.
   3. **`feat: add TableScan::statistics_requests field`** — an advisory 
`Vec<StatisticsRequest>` on `TableScan`, settable via 
`TableScan::with_statistics_requests` / `TableScanBuilder`. Empty by default; 
DataFusion's own rules never populate it.
   4. **`feat: thread statistics requests into ScanArgs`** — `ScanArgs` gains 
`statistics_requests`; the physical planner threads 
`TableScan::statistics_requests` into it so the request reaches 
`TableProvider::scan_with_args`.
   5. **`test: e2e statistics-request flow via a custom optimizer rule`** — an 
integration test playing both external roles.
   
   Deliberately **left out** vs #21996: the built-in `RequestStatistics` 
optimizer rule, the `FilePruner` / `ListingTable` consumer integration, the 
`PartitionedFile::satisfied_stats` per-file response field, and 
`StatisticsValue::Distribution` (which would depend on the now-deprecated 
`Distribution` type).
   
   ## Are these changes tested?
   
   Yes:
   - `datafusion-expr-common`: a unit test that `StatisticsRequest` is hashable 
/ usable as a `HashMap` key.
   - `datafusion/core/tests/user_defined/statistics_requests.rs`: an end-to-end 
integration test where a custom `OptimizerRule` annotates `TableScan` and a 
custom `TableProvider` asserts the requests reach `scan_with_args` — plus a 
test that without such a rule the provider sees an empty request list.
   - All existing `datafusion-expr` / `datafusion-optimizer` / 
`datafusion-proto` tests pass against the `TableScanBuilder` refactor.
   
   ## Are there any user-facing changes?
   
   Yes — this needs the `api change` label:
   
   - New public types `StatisticsRequest`, `StatisticsValue`, 
`SatisfiedStatistics` (re-exported via `datafusion_expr::statistics`).
   - New `TableScanBuilder`; `TableScan::try_new` is **deprecated** (still 
works, delegates to the builder).
   - `TableScan` gains a new public field `statistics_requests` — this breaks 
exhaustive `TableScan { .. }` struct literals downstream (the recommended fix 
is `TableScanBuilder`).
   - `ScanArgs` gains `with_statistics_requests` / `statistics_requests`.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to