alamb opened a new pull request, #10058:
URL: https://github.com/apache/arrow-rs/pull/10058

   ## What is this?
   
   A **draft** PR that bundles apache/arrow-rs#10020 ("Pluggable page spilling 
API for the Parquet ArrowWriter") **plus a runnable example** that exercises 
its new public `PageStore` API from the outside, as an end user would. It is 
intended to help **review** #10020 — see that PR for the API itself.
   
   This branch contains #10020's commits and adds one file on top, 
`parquet/examples/spill_page_store.rs`.
   
   ## The example
   
   Writes a wide, skewed Parquet file into a single row group:
   
   - a few `Int64` columns (`--int-columns`, default 3)
   - some small ~20-byte string columns (`--small-string-columns`, default 5)
   - a configurable pile of fat ~8 KiB string columns 
(`--large-string-columns`, default 10)
   
   and reports peak `ArrowWriter::memory_size()` with and without a spilling 
page store.
   
   The spilling backend is a ~30-line `TempFilePageStore` (one unlinked temp 
file per column chunk): `put` appends a page blob and returns an opaque 
`PageKey`, `take` seeks and reads it back, and `memory_size()` keeps its 
default of `0` because the bytes now live in the file, not on the heap.
   
   ### Flags
   
   - `--large-string-columns N` — number of fat ~8 KiB string columns (default 
10)
   - `--spill` — use the spilling `TempFilePageStore` instead of the default 
in-memory buffering
   - also: `--small-string-columns`, `--int-columns`, `--rows`, `--batch-size`, 
`--output <path>`
   
   ### Running
   
   ```sh
   # Baseline: default in-memory page buffering
   cargo run --release --features cli --example spill_page_store
   
   # Spill completed pages to temp files
   cargo run --release --features cli --example spill_page_store -- --spill
   ```
   
   ### What it shows
   
   On the defaults (10 fat columns, ~160 MiB row group):
   
   | page buffering | peak `ArrowWriter::memory_size()` |
   |---|---|
   | in-memory (default) | ~161 MiB |
   | `TempFilePageStore` (`--spill`) | ~21 MiB |
   
   i.e. spilling bounds peak writer memory by the in-flight encoder buffers 
rather than the row group size. Widen the skew with `--large-string-columns 30` 
to make the gap bigger. Writing the same data to a real file with `--output` 
produces a **byte-identical** Parquet file in both modes, confirming the 
spilling path is a transparent drop-in.
   
   ## Notes for review
   
   - `tempfile` and `sysinfo` are already `dev-dependencies` of the `parquet` 
crate, so the example needs no new deps; it is gated on `required-features = 
["arrow", "cli"]`.
   - The `TempFilePageStore` here is intentionally the same shape as the one in 
`parquet/tests/arrow_writer.rs`, so the example and the test corroborate each 
other.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to