asuresh8 opened a new pull request, #9521:
URL: https://github.com/apache/arrow-rs/pull/9521

   ## Which issue does this PR close?
   
   N/A - Performance optimization
   
   ## Rationale
   
   `GenericColumnWriter::add_data_page()` allocates fresh `Vec<u8>` buffers on 
every page flush: one for page assembly (`buffer`) and one for compression 
output (`compressed_buf`). For workloads that produce many pages (e.g., TPC-H 
SF1 lineitem: 17 columns x 49 row groups x ~70 pages/column = ~58K pages), this 
creates significant allocator overhead and cache pollution from touching fresh 
memory on every page.
   
   Additionally, the V1 compression path calls `compressed_buf.shrink_to_fit()` 
immediately before consuming the buffer, which triggers an unnecessary 
realloc+copy that is immediately discarded when the buffer is converted to 
`Bytes`.
   
   ## What changes are included in this PR?
   
   Add two persistent `Vec<u8>` fields (`page_buf` and `compressed_buf`) to 
`GenericColumnWriter` that are cleared and reused across data pages instead of 
allocating fresh buffers per page.
   
   **Changes:**
   1. Add `page_buf: Vec<u8>` and `compressed_buf: Vec<u8>` fields to 
`GenericColumnWriter`
   2. In `add_data_page()`, replace `let mut buffer = vec![]` with 
`self.page_buf.clear()` and reuse
   3. Replace `let mut compressed_buf = Vec::with_capacity(...)` with 
`self.compressed_buf.clear()` and reuse
   4. Use `Bytes::copy_from_slice()` to create the final `Bytes` from the 
reused buffer (since `Bytes::from(Vec<u8>)` would consume the allocation)
   5. Remove `shrink_to_fit()` after compression (counterproductive with buffer 
reuse)
   6. Update `memory_size()` to include the capacity of the reusable buffers 
(partial improvement to an already-approximate metric)
   7. Add 3 multi-page roundtrip tests with compression (V1, V2, nullable) that 
validate the `.clear()` invariant across page boundaries
   
   **Not changed:**
   - Dictionary page code (only ~17 allocations total per workload, not worth 
the diff)
   - `LevelEncoder` buffer allocation (small buffers, would require API changes)
   - Any public API
   
   **Trade-off:** `Bytes::copy_from_slice` adds one memcpy per page vs the 
previous zero-copy `Bytes::from(Vec<u8>)`. However, the eliminated allocations 
(2-3 per page), removed `shrink_to_fit` realloc, and cache warming from reused 
buffers provide a net benefit. The copy operates on the smaller compressed data 
when compression is enabled.
   
   ## Are these changes tested?
   
   Yes. All 88 existing tests pass unchanged. 3 new tests added:
   - `test_multi_page_roundtrip_with_compression` - V1 path, 5 pages, Snappy
   - `test_multi_page_roundtrip_with_compression_v2` - V2 path, 5 pages, Snappy
   - `test_multi_page_roundtrip_nullable_with_compression` - V1 path with 
nullable columns, 5 pages, Snappy
   
   These tests write enough data to trigger multiple data pages (via 
`set_data_page_row_count_limit(1000)` with 5000 values), then read back and 
verify all values match. This validates that `.clear()` properly prevents stale 
data leakage between pages.
   
   ## Are there any user-facing changes?
   
   No. All changes are to private fields and internal methods.
   
   ## Benchmark Results
   
   `cargo bench -p parquet --bench arrow_writer` on Apple M-series:
   
   | Category | Improved (>2%) | Regressed (>2%) | Neutral |
   |----------|---------------|-----------------|---------|
   | Count    | 13            | 0               | 42      |
   
   Aggregate: -0.76% (excluding float_with_nans noise, confirmed by re-run)
   
   Notable improvements:
   - `primitive_non_null/zstd_parquet_2`: -4.8%
   - `primitive_non_null/bloom_filter`: -4.0%
   - `bool_non_null/zstd_parquet_2`: -3.4%
   - `string/zstd_parquet_2`: -3.4%
   - `string/parquet_2`: -3.0%
   - `string_dictionary/default`: -2.7%
   
   No confirmed regressions (initial `float_with_nans` regression was 
measurement noise, confirmed by subsequent run with p > 0.05).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to