Re: [PR] GH-37876: [Format] Add list-view specification to arrow format [arrow]

via GitHub Mon, 02 Oct 2023 07:40:44 -0700


felipecrv commented on code in PR #37877:
URL: https://github.com/apache/arrow/pull/37877#discussion_r1342783110



##########
docs/source/format/Columnar.rst:
##########
@@ -487,6 +499,102 @@ will be represented as follows: ::
           |-------------------------------|-----------------------|
           | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |
 
+ListView Layout
+~~~~~~~~~~~~~~~
+
+The ListView layout is defined by three buffers: a validity bitmap, an offsets
+buffer, and an additional sizes buffer. Sizes and offsets have the identical 
bit
+width and both 32-bit and 64-bit signed integer options are supported.
+
+As in the List layout, the offsets encode the start position of each slot in 
the
+child array. In contrast to the List layout, list lengths are stored explicitly
+in the sizes buffer instead of inferred. This allows offsets to be out of 
order.
+Elements of the child array do not have to be stored in the same order they
+logically appear in the list elements of the parent array.
+
+When a value is null, the corresponding offset and size can have arbitrary
+values. When size is 0, the corresponding offset can have an arbitrary value.
+If choosing a value is possible, we recommend setting offsets and sizes to 0 in

Review Comment:
   It's not so much about possibility (so I'm very open to re-wording), but 
more about machine affinity (having less branches in loops, SIMD...). I can't 
predict all the possible ways people might want to save time in compute kernels.
   
   For instance, when I slice the list-views for concatenation I have to 
subtract values from offsets (because I'm compressing the child array). I can 
do it in multiple ways:
   
    - `offsets[i] = offsets[i] - displacement;` leading to the possibility of 
negative offsets when `sizes[i] == 0` and `offsets[i]` is `unspecified` and 
potentially smaller than the positive displacement.
    - `offsets[i] = std::max(0, offsets[i] - displacement);` guaranteeing it's 
zero but introducing branching `*`
    - skipping setting `offsets[i]` at all when `sizes[i] == 0` introducing 
branching
    
   I don't think this is an important enough operation but it illustrates how 
easy these stricter requirements could get in the way of micro optimizations. I 
suspect these become even more relevant when attempts to vectorize these loops 
are made.
   
   `*` or requiring branchless saturating math operations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-37876: [Format] Add list-view specification to arrow format [arrow]

Reply via email to