alamb opened a new pull request #12019: URL: https://github.com/apache/arrow/pull/12019
# Rationale The question of "what are the values of the offsets for non-valid entries in arrays" came up in arrow-rs: https://github.com/apache/arrow-rs/issues/1071 and the existing [docs](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout) seem to be somewhat vague on this issue. I looked at three implementations of arrow, and they all seem to assume / validate the offsets are monotonic: * C++ implementation (I think) also also ensures the offsets are monotonic without first checking the validity array https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/validate.cc#L568-L592 * arrow-rs after https://github.com/apache/arrow-rs/pull/921 (based on the C++) will refuse to create arrays where the array offsets are non monotonic * arrow2 also ensures that offsets are always monotonic. https://github.com/jorgecarleitao/arrow2/blob/37a9c758826a92d98dc91e992b2a49ce9724095d/src/array/specification.rs#L102-L119 # Changes Thus I propose updating the format docs to make the monotonic offsets explicit. # Background I think @jorgecarleitao's description on https://github.com/apache/arrow-rs/issues/1071#issuecomment-998481607, explains the reason why having monotonic offsets is a good idea > I think that in general the property we seek is: discarding the validity cannot result in UB when accessing the values. This justifies the values buffer of a primitive array is always initialized, and the offsets being valid and in-bounds even in null cases. > > The rational for this is that sometimes it is faster to skip validity accesses and only iterate over the values (and clone the validity). I do not recall the benchmark result, but this may explain why string comparison ignores validity and & the bitmaps instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org