tustvold commented on code in PR #35628:
URL: https://github.com/apache/arrow/pull/35628#discussion_r1278528048


##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,51 @@ will be represented as follows: ::
     |----------------|----------------------|
     | joemark        | unspecified          |
 
+Variable-size Binary View Layout
+--------------------------------
+
+Each value in this layout consists of 0 or more bytes. These characters'
+locations are indicated using a **views** buffer, which may point to one
+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:
+
+::
+
+    * Short strings, length <= 12
+      | Bytes 0-3  | Bytes 4-15                            |
+      |------------|---------------------------------------|
+      | length     | data (padded with 0)                  |
+
+    * Long strings, length > 12
+      | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
+      |------------|------------|------------|-------------|
+      | length     | prefix     | buf. index | offset      |
+
+In both the long and short string cases, the first four bytes encode the
+length of the string and can be used to determine how the rest of the view
+should be interpreted.
+
+In the short string case the string's bytes are inlined- stored inside the
+view itself, in the twelve bytes which follow the length.
+
+In the long string case, a buffer index indicates which character buffer
+stores the characters and an offset indicates where in that buffer the
+characters begin. Buffer index 0 refers to the first character buffer, IE
+the first buffer **after** the validity buffer and the views buffer.
+The half-open range ``[offset, offset + length)`` must be entirely contained
+within the indicated buffer. A copy of the first four bytes of the string is
+stored inline in the prefix, after the length. This prefix enables a
+profitable fast path for string comparisons, which are frequently determined
+within the first four bytes.
+

Review Comment:
   ```suggestion
   
   All views must be well defined, even for null slots, in particular if the 
length is greater than 12, the prefix, buffer index and offset must refer to 
valid data.
   
   ```
   
   This is a very important property for the Rust implementation to be able to 
provide safe value access without needing to inspect the null mask. This in 
turn is important because it allows more sophisticated strategies to handle / 
iterate the null mask. 



##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,51 @@ will be represented as follows: ::
     |----------------|----------------------|
     | joemark        | unspecified          |
 
+Variable-size Binary View Layout
+--------------------------------
+
+Each value in this layout consists of 0 or more bytes. These characters'
+locations are indicated using a **views** buffer, which may point to one
+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:

Review Comment:
   The endianness of this data structure wasn't immediately apparent to me, I 
interpreted the view as being a single 128-bit integer with the native 
endianness. I believe this is consistent with intervals



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to