Re: [PR] GH-37876: [Format] Add list-view specification to arrow format [arrow]

via GitHub Mon, 02 Oct 2023 07:44:53 -0700


pitrou commented on code in PR #37877:
URL: https://github.com/apache/arrow/pull/37877#discussion_r1342788133



##########
docs/source/format/Columnar.rst:
##########
@@ -487,6 +499,102 @@ will be represented as follows: ::
           |-------------------------------|-----------------------|
           | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |
 
+ListView Layout
+~~~~~~~~~~~~~~~
+
+The ListView layout is defined by three buffers: a validity bitmap, an offsets
+buffer, and an additional sizes buffer. Sizes and offsets have the identical 
bit
+width and both 32-bit and 64-bit signed integer options are supported.
+
+As in the List layout, the offsets encode the start position of each slot in 
the
+child array. In contrast to the List layout, list lengths are stored explicitly
+in the sizes buffer instead of inferred. This allows offsets to be out of 
order.
+Elements of the child array do not have to be stored in the same order they
+logically appear in the list elements of the parent array.
+
+When a value is null, the corresponding offset and size can have arbitrary
+values. When size is 0, the corresponding offset can have an arbitrary value.

Review Comment:
   > I consider that being more strict on what consumers of list-views can do 
than saying "go ahead and feel free to dereference any offset". Being stricter 
on producers of list-views doesn't make the consumers protected from malicious 
or bogus producers.
   
   That's a good point. But it also means that, instead of delegating handling 
of "wild" offsets to a validation routine, consumers have to be careful to 
accept such offsets at any place where ListView data is processed. This 
requires more care from implementations, with no upside that I can think of.
   
   > Being able to pass random list-view arrays with wild offsets[i] when 
sizes[i] == 0 is how we can fuzz the implementations and get ASan / UBSan to 
catch the mistakes.
   
   We already fuzz Arrow C++ with invalid data using the IPC layer. The way 
it's done is precisely to call the validation routine after having read the 
data (if reading the data was at all successful).
   
   Conversely, we cannot possibly fuzz all internal routines (such as compute 
functions, etc.) against invalid or non-conventional data. We should indeed try 
to generate interesting (but valid) cases in our random generation functions, 
but not all internal routines are tested against random data.
   
   So it seems to me that being stricter at the edge helps safety more than 
making the format more lenient so that we can better fuzz the internals.
   
   @alamb @tustvold From the perspective of the Arrow Rust implementation, what 
is your opinion?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-37876: [Format] Add list-view specification to arrow format [arrow]

Reply via email to