etseidl commented on code in PR #9653:
URL: https://github.com/apache/arrow-rs/pull/9653#discussion_r3045962251


##########
parquet/src/encodings/rle.rs:
##########
@@ -122,6 +122,27 @@ impl RleEncoder {
         bit_packed_max_size.max(rle_max_size)
     }
 
+    /// Returns `true` if the encoder is currently in RLE accumulation mode
+    /// for the given value (i.e., `repeat_count > 8` and `current_value == 
value`).
+    ///
+    /// When this returns `true`, callers may use 
[`extend_run`](Self::extend_run)
+    /// to add more repetitions without per-element overhead.
+    #[inline]
+    pub fn is_accumulating(&self, value: u64) -> bool {
+        self.repeat_count > 8 && self.current_value == value

Review Comment:
   > The RLE encoder transitions to accumulation mode **after** the 8th value 
has been buffered and `flush_buffered_values()` has committed the RLE decision.
   
   Here's my understanding: a repeated value is added wit `put`. The 
`repeat_count` is incremented, and it reaches 8. This does not trigger the 
return branch, and continues on. `num_buffered_values` is currently 7, the 
value is added to the `buffered_values` array, and `num_buffered_values` is 
incremented to 8. This triggers `flush_buffered_values()`. 
`flush_buffered_values()` sees that `repeat_count` is 8, so it simply sets 
`num_buffered_values` to 0 and potentially ends a previous bit-packed run by 
writing the run length indicator, and returns. We then return from `put` with 
`repeat_count` still 8, `num_buffered_values = 0`, and we're now in 
accumulating mode. If `is_accumulating()` is called after a this `put()` (which 
seems to always be the case), I think `>= 8' is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to