[I] Parquet: level encoding cost should be proportional to RLE output size, not input row count [arrow-rs]

via GitHub Wed, 01 Apr 2026 20:35:28 -0700


HippoBaro opened a new issue, #9652:
URL: https://github.com/apache/arrow-rs/issues/9652


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Very sparse columns (high null ratio) are just as expensive to write as 
writing dense, high-cardinality ones, even though the underlying encoding (RLE) 
compresses long runs of identical values into a single entry.
   
   The cost of writing should reflect the cost of encoding: writing the same 
value a million times should be roughly as cheap as writing it once.
   
   **Describe the solution you'd like**
   The writer should perform per-run work instead of per-value work wherever 
possible. When long runs of identical definition/repetition levels are detected 
(as is typical for sparse columns), counting, histogram updates, and RLE 
encoding should all be amortized over the entire run in O(1) rather than O(n). 
Entirely-null columns should be an especially cheap special case
   
   **Describe alternatives you've considered**
   N/A
   
   **Additional context**
   N/A
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parquet: level encoding cost should be proportional to RLE output size, not input row count [arrow-rs]

Reply via email to