adriangb opened a new issue, #10062: URL: https://github.com/apache/arrow-rs/issues/10062
## Background [#10020](https://github.com/apache/arrow-rs/pull/10020) adds a pluggable `PageStore` so the `ArrowWriter` can spill completed column-chunk pages out of the heap while a row group is buffered. During review ([this thread](https://github.com/apache/arrow-rs/pull/10020#discussion_r2944) and the surrounding ones) @alamb noted that the writer now has *two* levels of buffering and that the dictionary page is special-cased in a way that's hard to follow. This issue tracks the deferred cleanup so #10020 can land without it. ## The root cause A dictionary page is finalized **last** (it isn't known until `close()` — dictionary fallback can occur mid-column) but must be written **first** in the file. So production order is `data₁…dataₙ, dict` and file order is `dict, data₁…dataₙ`. Resolving that inversion is what creates the apparent double-buffering: - `GenericColumnWriter::data_pages: VecDeque<CompressedPage>` (`column/writer/mod.rs`) — the pre-existing buffer that holds dictionary-column data pages until `close()` on the column-at-a-time `SerializedFileWriter` path (bytes commit live there, so there's no other way to get dict-first). - On the `ArrowWriter` path #10020 instead streams data pages straight to the `PageStore`, holds the dict page in memory (`ArrowColumnChunkData.dictionary: Vec<Bytes>`), and rewrites offsets to dict-first at splice (`ColumnCloseResult::update_dictionary_location`). The seam between these is the back-channel `PageWriter::defers_dictionary_ordering()` flag, which makes the column writer change its buffering based on a property of its sink. (As cheap insurance, that method and `buffered_memory_size()` were marked `#[doc(hidden)]` / unstable in #10020 so this cleanup won't be a breaking change — see f4e3518581.) ## Proposed cleanup (two separable steps) **B — route the dictionary page through the store too.** Replace the `dictionary: Vec<Bytes>` carve-out with a tracked `dictionary_key: Option<PageKey>`; at splice, take that key first, then the data keys. This makes the store uniform (it no longer treats the dict page specially) and lets dictionary pages spill as well — relevant for misconfigured writes with a large dict page. Entirely private to `ArrowColumnChunkData`. Trade-off to weigh: today the dict page is kept resident deliberately to avoid round-tripping ~1 bounded page through the backend. **C — move page-ordering ownership into the page-writer layer.** Have `GenericColumnWriter` always stream pages in production order and let the page writer own final layout: `SerializedPageWriter` buffers data pages until it sees the dict, `ArrowPageWriter` buffers-and-reorders. This removes the `defers_dictionary_ordering()` branch and the `data_pages` field from the shared column writer entirely. Note it relocates rather than eliminates the column-at-a-time buffering (you still can't seek back in a plain `Write` sink), but it gives one clean seam instead of a flag-driven split. Touches the column writer used by **both** write paths, so it wants its own benchmarking. B and C compose; B+C is the full "defer all buffering to the store; the store needn't know page kinds" that was suggested in review. ## Constraints / non-goals - **Not a breaking change to the externally-implemented surface.** `PageStore` (`put`/`take`/`PageKey`), `PageStoreFactory`, and `PageStoreArgs` stay as-is; B reuses the same `put`/`take`. The only public-API churn is to the already-`#[doc(hidden)]` `PageWriter` methods. - Keep the **key-based** `PageStore` API (not a page iterator): the dict-first reordering needs random access, which a FIFO iterator can't express. - No change to on-disk output beyond what #10020 already does (the `page_encoding_stats` ordering difference for dictionary columns, which is spec-valid). ## Acceptance criteria - [ ] `defers_dictionary_ordering()` removed (or its role absorbed) and `GenericColumnWriter` has a single buffering path. - [ ] Dictionary pages are no longer special-cased outside the store (B), or this is explicitly decided against with rationale. - [ ] Existing round-trip / spill memory tests still pass; column-at-a-time write memory and throughput benchmarked for no regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
