tustvold opened a new issue, #2889:
URL: https://github.com/apache/arrow-rs/issues/2889
**Describe the bug**
<!--
A clear and concise description of what the bug is.
-->
The size of RLE encoded data is routinely estimated as
```
RleEncoder::min_buffer_size(bit_width)
+ RleEncoder::max_buffer_size(bit_width, self.indices.len())
```
Where `RleEncoder::min_buffer_size` is defined as
```
let max_bit_packed_run_size = 1 + bit_util::ceil(
(MAX_VALUES_PER_BIT_PACKED_RUN * bit_width as usize) as i64,
8,
);
let max_rle_run_size =
bit_util::MAX_VLQ_BYTE_LEN + bit_util::ceil(bit_width as i64, 8) as
usize;
std::cmp::max(max_bit_packed_run_size as usize, max_rle_run_size)
```
In practice this will almost always be `64 * bit_width`.
```
let bytes_per_run = bit_width;
let num_runs = bit_util::ceil(num_values as i64, 8) as usize;
let bit_packed_max_size = num_runs + num_runs * bytes_per_run as usize;
let min_rle_run_size = 1 + bit_util::ceil(bit_width as i64, 8) as usize;
let rle_max_size =
bit_util::ceil(num_values as i64, 8) as usize * min_rle_run_size;
std::cmp::max(bit_packed_max_size, rle_max_size) as usize
```
**To Reproduce**
<!--
Steps to reproduce the behavior:
-->
It is unclear why min_buffer_size is included in the size estimation at all,
and the definition of max_buffer_size is overly pessimistic in that it assumes
a maximum bit packed run length of 8 values, when in actuality it is currently
512.
**Expected behavior**
<!--
A clear and concise description of what you expected to happen.
-->
A more accurate size estimation of written RLE encoded data
**Additional context**
<!--
Add any other context about the problem here.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]