felipecrv commented on code in PR #34896:
URL: https://github.com/apache/arrow/pull/34896#discussion_r1164438328


##########
cpp/src/arrow/util/ree_util.h:
##########
@@ -231,31 +253,52 @@ class RunEndEncodedArraySpan {
   ///
   /// \param logical_pos is an index in the [0, length()] range
   Iterator iterator(int64_t logical_pos) const {
-    return iterator(logical_pos, logical_pos < length()
-                                     ? PhysicalIndex(logical_pos)
-                                     : RunEndsArray(array_span).length);
+    if (logical_pos < length()) {
+      return iterator(logical_pos, PhysicalIndex(logical_pos));
+    }
+    // If logical_pos is above the valid range, use length() as the logical
+    // position and calculate the physical address right after the last valid
+    // physical position. Which is the physical index of the last logical
+    // position, plus 1.
+    return (length() == 0) ? iterator(0, PhysicalIndex(0))
+                           : iterator(length(), PhysicalIndex(length() - 1) + 
1);
   }
 
   /// \brief Create an iterator representing the logical begin of the run-end
   /// encoded array
-  Iterator begin() const { return iterator(0); }
+  Iterator begin() const { return iterator(0, PhysicalIndex(0)); }
 
   /// \brief Create an iterator representing the first invalid logical position
   /// of the run-end encoded array
   ///
-  /// The Iterator returned by end() should not be
+  /// \warning Avoid calling end() in a loop, as it will recompute the physical
+  /// length of the array on each call (O(log N) cost per call).
+  ///
+  /// \par You can write your loops like this instead:
+  /// \code
+  /// for (auto it = array.begin(), end = array.end(); it != end; ++it) {
+  ///   // ...
+  /// }
+  /// \endcode
+  ///
+  /// \par Or this version that does not look like idiomatic C++, but removes
+  /// the need for calling end() completely:
+  /// \code
+  /// for (auto it = array.begin(); !it.is_end(array); ++it) {
+  ///   // ...
+  /// }
+  /// \endcode

Review Comment:
   Things got tricky because I want to be able to land on a valid value after 
applying `--` to the Iterator returned by `array.end()`. Introducing a sentinel 
value would add a check for every `--` and give the compiler a hard time 
optimizing loops. What I had before this PR kinda worked (taking the physical 
length), but didn't survive the introduction of `operator--`.
   
   
   `for (auto it = array.begin(), end = array.end(); it != end; ++it)` is what 
performance-conscious C++ developers already use to write loops and the only 
extra cost incurred is a single O(log n) search. So I don't think adding more 
fields to the iterator and complicating the loop conditions is a good idea.
   
   `for (auto it = array.begin(); !it.is_end(array); ++it)` is weird, but makes 
the perfectionists (me) very happy about eliminating that extra O(log n) search 
completely.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to