alamb opened a new pull request #12019:
URL: https://github.com/apache/arrow/pull/12019


   # Rationale
   The question of "what are the values of the offsets for non-valid entries in 
arrays" came up in arrow-rs: https://github.com/apache/arrow-rs/issues/1071 and 
the existing 
[docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout)
 seem to be somewhat vague on this issue.
   
   I looked at three implementations of arrow, and they all seem to assume / 
validate the offsets are monotonic:
   * C++ implementation (I think) also also ensures the offsets are monotonic 
without first checking the validity array 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592
   * arrow-rs after https://github.com/apache/arrow-rs/pull/921 (based on the 
C++) will refuse to create arrays where the array offsets are non monotonic 
   * arrow2 also ensures that offsets are always monotonic. 
   
https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119
   
   # Changes
   Thus I propose updating the format docs to make the monotonic offsets 
explicit. 
   
   # Background
   I think @jorgecarleitao's description on  
https://github.com/apache/arrow-rs/issues/1071#issuecomment-998481607, explains 
the reason why having monotonic offsets is a good idea
   
   > I think that in general the property we seek is: discarding the validity 
cannot result in UB when accessing the values. This justifies the values buffer 
of a primitive array is always initialized, and the offsets being valid and 
in-bounds even in null cases.
   >
   > The rational for this is that sometimes it is faster to skip validity 
accesses and only iterate over the values (and clone the validity). I do not 
recall the benchmark result, but this may explain why string comparison ignores 
validity and & the bitmaps instead.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to