adriangb opened a new pull request, #22024:
URL: https://github.com/apache/datafusion/pull/22024
## Which issue does this PR close?
- Relates to #21624 (`datafusion.execution.collect_statistics` on wide
tables).
This PR extracts the first layer of #22000 — the opt-in `ParquetSource`
sampling primitives — as a self-contained change. The TABLESAMPLE SQL surface,
`Sample` logical/physical nodes, and `SamplePushdown` rule are deliberately not
included; they will land in follow-ups.
## Rationale for this change
DataFusion has the machinery for fine-grained parquet sampling
(`ParquetAccessPlan` with `Skip` / `Scan` / `Selection(RowSelection)`) but no
public way to ask for a sample without constructing the access plan by hand and
stuffing it into `PartitionedFile.extensions`. That works for one-off code but
is awkward for ad-hoc data exploration, layered helpers that want to compute
approximate stats over a bounded slice, and `EXPLAIN ANALYZE`-driven debug runs
against a representative slice.
This PR adds the lowest layer: opt-in builders on `ParquetSource` that
translate fractions into a `ParquetAccessPlan` lazily inside the opener (after
the footer is loaded, so we sample by real row-group index). It is additive and
has no behavior change for existing scans. The SQL surface in #22000 is built
on top of these primitives.
## What changes are included in this PR?
```rust
ParquetSource::new(schema)
.with_row_group_sampling(0.1) // keep ~10% of row groups per file
.with_row_fraction(0.05) // within each kept row group, keep ~5%
of rows
.with_row_cluster_size(8192); // controls window granularity (default
32 768)
```
`with_row_group_sampling(fraction)`:
* Selection is deferred until the opener has loaded the parquet footer, so
we sample by real row-group index.
* Deterministic per `(file_index, row_group_count, fraction)` — re-runs
match. The opener passes the execution `partition_index` as the stable
`file_index`, so sampling is reproducible across environments without depending
on object-store paths.
* Always keeps at least one row group (target = `max(1, ceil(N *
fraction))`).
* No-op when `fraction >= 1.0`.
`with_row_fraction(fraction)`:
* Translates the per-row-group target into K contiguous windows spread
evenly across the row group, each placed at a random offset within its stride.
Window count = `ceil(target / cluster_size)`.
* Materializes a `RowSelection` per kept row group; the parquet reader uses
the page index to read only the data pages covering the selected rows. This
gives \"page-level\" IO savings without requiring per-column page alignment
(which doesn't exist in parquet).
* Falls back gracefully when the page index is missing — the reader still
returns the right rows, the IO win just disappears.
* Deterministic per `(file_index, row_group_index, fraction, cluster_size)`.
* Window-layout math is extracted into a dedicated
`build_row_window_selectors` function and fuzz-tested across thousands of
configurations to guarantee no overlap, in-bounds positions, and full coverage.
The two layers compose: `row_group_fraction = 0.1` × `row_fraction = 0.1`
reads ~1% of the rows from ~10% of the row groups, with windows spread out so
the sample isn't clustered at one end of each row group.
### Internals
* New `ParquetSampling` struct re-exported at the crate root.
* Plumbed through `ParquetMorselizer` → `PreparedParquetOpen`.
* Two free functions invoked from `prune_row_groups` right after
`create_initial_plan`.
* New dep: `rand` with the `small_rng` feature (already in workspace
`Cargo.toml`).
### Differences vs. the original commit in #22000
Two pieces of review feedback on the parent PR are folded in here:
* `apply_*_sampling` keys on a stable `file_index: usize` rather than
`file_name: &str`, addressing
https://github.com/apache/datafusion/pull/22000#discussion_r3187286356. The
opener passes the execution `partition_index`. This removes the on-disk-path
dependency from the seed inputs while still decorrelating files in different
partitions.
* The row-window layout math is extracted into `build_row_window_selectors`
and fuzz-tested
(https://github.com/apache/datafusion/pull/22000#discussion_r3187392811).
Fuzzing surfaced an overlap bug at `target_rows ≈ total_rows` where
`window_size = ceil(target / n_windows)` could exceed `stride = total_rows /
n_windows`; the fix caps `window_size` at `stride`.
## Are these changes tested?
12 tests in `datafusion-datasource-parquet`:
**Row-group sampling (`sampling::tests`):**
* `row_group_sampling_keeps_target_count` — `ceil(N * fraction)` math.
* `row_group_sampling_is_deterministic` — same inputs → same selection.
* `row_group_sampling_differs_per_file_index` — different `file_index` →
different sample.
* `row_group_sampling_no_op_when_fraction_is_one` — fraction ≥ 1.0 keeps
everything.
* `row_group_sampling_target_at_least_one` — `fraction = 0.001` over 100 row
groups still keeps 1.
* `row_group_sampling_no_op_when_unset` — `None` is a no-op.
**Row-window selection (`sampling::tests`):**
* `row_window_selection_basic_layout` — hand-checked anchor case.
* `row_window_selection_returns_none_on_invalid_input` — degenerate inputs
(zero row group, zero target, zero cluster) return `None`.
* `row_window_selection_full_target_no_overlap` — the previously-buggy
`target_rows == total_rows` case.
* `row_window_selection_fuzz_invariants` — 5 000 randomized `(total_rows,
target_rows, cluster_size, seed)` configurations, asserting full coverage,
in-bounds positions, and no overlap.
* `row_window_selection_fuzz_determinism` — 1 000 iterations verifying
identical seeds produce identical layouts.
**End-to-end (`opener::test`):**
* `row_group_sampling_end_to_end` — writes a 4-row-group parquet to
`InMemory`, scans with `fraction = 0.5`, asserts exactly 6 rows out (2 row
groups × 3 rows).
* `row_fraction_end_to_end` — writes a 100-row single-row-group parquet,
scans with `row_fraction = 0.1` and `cluster_size = 4`, asserts the result is
in the expected range.
`cargo build`, `cargo fmt --all`, and `cargo clippy -p
datafusion-datasource-parquet --all-targets --all-features -- -D warnings` are
clean.
## Are there any user-facing changes?
* **New public Rust API on `ParquetSource`:** `with_row_group_sampling`,
`with_row_fraction`, `with_row_cluster_size`, `sampling()`, plus the
`ParquetSampling` struct.
* Existing scans / queries that don't opt in are unaffected.
* No SQL surface yet — that lands in a follow-up to #22000.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]