HippoBaro opened a new issue, #9652: URL: https://github.com/apache/arrow-rs/issues/9652
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** Very sparse columns (high null ratio) are just as expensive to write as writing dense, high-cardinality ones, even though the underlying encoding (RLE) compresses long runs of identical values into a single entry. The cost of writing should reflect the cost of encoding: writing the same value a million times should be roughly as cheap as writing it once. **Describe the solution you'd like** The writer should perform per-run work instead of per-value work wherever possible. When long runs of identical definition/repetition levels are detected (as is typical for sparse columns), counting, histogram updates, and RLE encoding should all be amortized over the entire run in O(1) rather than O(n). Entirely-null columns should be an especially cheap special case **Describe alternatives you've considered** N/A **Additional context** N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
