alamb opened a new pull request, #10058:
URL: https://github.com/apache/arrow-rs/pull/10058
## What is this?
A **draft** PR that bundles apache/arrow-rs#10020 ("Pluggable page spilling
API for the Parquet ArrowWriter") **plus a runnable example** that exercises
its new public `PageStore` API from the outside, as an end user would. It is
intended to help **review** #10020 — see that PR for the API itself.
This branch contains #10020's commits and adds one file on top,
`parquet/examples/spill_page_store.rs`.
## The example
Writes a wide, skewed Parquet file into a single row group:
- a few `Int64` columns (`--int-columns`, default 3)
- some small ~20-byte string columns (`--small-string-columns`, default 5)
- a configurable pile of fat ~8 KiB string columns
(`--large-string-columns`, default 10)
and reports peak `ArrowWriter::memory_size()` with and without a spilling
page store.
The spilling backend is a ~30-line `TempFilePageStore` (one unlinked temp
file per column chunk): `put` appends a page blob and returns an opaque
`PageKey`, `take` seeks and reads it back, and `memory_size()` keeps its
default of `0` because the bytes now live in the file, not on the heap.
### Flags
- `--large-string-columns N` — number of fat ~8 KiB string columns (default
10)
- `--spill` — use the spilling `TempFilePageStore` instead of the default
in-memory buffering
- also: `--small-string-columns`, `--int-columns`, `--rows`, `--batch-size`,
`--output <path>`
### Running
```sh
# Baseline: default in-memory page buffering
cargo run --release --features cli --example spill_page_store
# Spill completed pages to temp files
cargo run --release --features cli --example spill_page_store -- --spill
```
### What it shows
On the defaults (10 fat columns, ~160 MiB row group):
| page buffering | peak `ArrowWriter::memory_size()` |
|---|---|
| in-memory (default) | ~161 MiB |
| `TempFilePageStore` (`--spill`) | ~21 MiB |
i.e. spilling bounds peak writer memory by the in-flight encoder buffers
rather than the row group size. Widen the skew with `--large-string-columns 30`
to make the gap bigger. Writing the same data to a real file with `--output`
produces a **byte-identical** Parquet file in both modes, confirming the
spilling path is a transparent drop-in.
## Notes for review
- `tempfile` and `sysinfo` are already `dev-dependencies` of the `parquet`
crate, so the example needs no new deps; it is gated on `required-features =
["arrow", "cli"]`.
- The `TempFilePageStore` here is intentionally the same shape as the one in
`parquet/tests/arrow_writer.rs`, so the example and the test corroborate each
other.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]