adriangb opened a new issue, #10062:
URL: https://github.com/apache/arrow-rs/issues/10062

   ## Background
   
   [#10020](https://github.com/apache/arrow-rs/pull/10020) adds a pluggable 
`PageStore` so the `ArrowWriter` can spill completed column-chunk pages out of 
the heap while a row group is buffered. During review ([this 
thread](https://github.com/apache/arrow-rs/pull/10020#discussion_r2944) and the 
surrounding ones) @alamb noted that the writer now has *two* levels of 
buffering and that the dictionary page is special-cased in a way that's hard to 
follow. This issue tracks the deferred cleanup so #10020 can land without it.
   
   ## The root cause
   
   A dictionary page is finalized **last** (it isn't known until `close()` — 
dictionary fallback can occur mid-column) but must be written **first** in the 
file. So production order is `data₁…dataₙ, dict` and file order is `dict, 
data₁…dataₙ`. Resolving that inversion is what creates the apparent 
double-buffering:
   
   - `GenericColumnWriter::data_pages: VecDeque<CompressedPage>` 
(`column/writer/mod.rs`) — the pre-existing buffer that holds dictionary-column 
data pages until `close()` on the column-at-a-time `SerializedFileWriter` path 
(bytes commit live there, so there's no other way to get dict-first).
   - On the `ArrowWriter` path #10020 instead streams data pages straight to 
the `PageStore`, holds the dict page in memory 
(`ArrowColumnChunkData.dictionary: Vec<Bytes>`), and rewrites offsets to 
dict-first at splice (`ColumnCloseResult::update_dictionary_location`).
   
   The seam between these is the back-channel 
`PageWriter::defers_dictionary_ordering()` flag, which makes the column writer 
change its buffering based on a property of its sink. (As cheap insurance, that 
method and `buffered_memory_size()` were marked `#[doc(hidden)]` / unstable in 
#10020 so this cleanup won't be a breaking change — see f4e3518581.)
   
   ## Proposed cleanup (two separable steps)
   
   **B — route the dictionary page through the store too.** Replace the 
`dictionary: Vec<Bytes>` carve-out with a tracked `dictionary_key: 
Option<PageKey>`; at splice, take that key first, then the data keys. This 
makes the store uniform (it no longer treats the dict page specially) and lets 
dictionary pages spill as well — relevant for misconfigured writes with a large 
dict page. Entirely private to `ArrowColumnChunkData`. Trade-off to weigh: 
today the dict page is kept resident deliberately to avoid round-tripping ~1 
bounded page through the backend.
   
   **C — move page-ordering ownership into the page-writer layer.** Have 
`GenericColumnWriter` always stream pages in production order and let the page 
writer own final layout: `SerializedPageWriter` buffers data pages until it 
sees the dict, `ArrowPageWriter` buffers-and-reorders. This removes the 
`defers_dictionary_ordering()` branch and the `data_pages` field from the 
shared column writer entirely. Note it relocates rather than eliminates the 
column-at-a-time buffering (you still can't seek back in a plain `Write` sink), 
but it gives one clean seam instead of a flag-driven split. Touches the column 
writer used by **both** write paths, so it wants its own benchmarking.
   
   B and C compose; B+C is the full "defer all buffering to the store; the 
store needn't know page kinds" that was suggested in review.
   
   ## Constraints / non-goals
   
   - **Not a breaking change to the externally-implemented surface.** 
`PageStore` (`put`/`take`/`PageKey`), `PageStoreFactory`, and `PageStoreArgs` 
stay as-is; B reuses the same `put`/`take`. The only public-API churn is to the 
already-`#[doc(hidden)]` `PageWriter` methods.
   - Keep the **key-based** `PageStore` API (not a page iterator): the 
dict-first reordering needs random access, which a FIFO iterator can't express.
   - No change to on-disk output beyond what #10020 already does (the 
`page_encoding_stats` ordering difference for dictionary columns, which is 
spec-valid).
   
   ## Acceptance criteria
   
   - [ ] `defers_dictionary_ordering()` removed (or its role absorbed) and 
`GenericColumnWriter` has a single buffering path.
   - [ ] Dictionary pages are no longer special-cased outside the store (B), or 
this is explicitly decided against with rationale.
   - [ ] Existing round-trip / spill memory tests still pass; column-at-a-time 
write memory and throughput benchmarked for no regression.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to