paleolimbot commented on code in PR #37526:
URL: https://github.com/apache/arrow/pull/37526#discussion_r1330655138


##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,53 @@ will be represented as follows: ::
     |----------------|-----------------------|
     | joemark        | unspecified (padding) |
 
+.. versionadded:: Arrow Columnar Format 1.4
+
+Variable-size Binary View Layout
+--------------------------------
+
+Each value in this layout consists of 0 or more bytes. These characters'
+locations are indicated using a **views** buffer, which may point to one
+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:
+
+::
+
+    * Short strings, length <= 12
+      | Bytes 0-3  | Bytes 4-15                            |
+      |------------|---------------------------------------|
+      | length     | data (padded with 0)                  |
+
+    * Long strings, length > 12
+      | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
+      |------------|------------|------------|-------------|
+      | length     | prefix     | buf. index | offset      |
+
+In both the long and short string cases, the first four bytes encode the
+length of the string and can be used to determine how the rest of the view
+should be interpreted.
+
+In the short string case the string's bytes are inlined- stored inside the
+view itself, in the twelve bytes which follow the length.
+
+In the long string case, a buffer index indicates which character buffer
+stores the characters and an offset indicates where in that buffer the
+characters begin. Buffer index 0 refers to the first character buffer, IE
+the first buffer **after** the validity buffer and the views buffer.
+The half-open range ``[offset, offset + length)`` must be entirely contained
+within the indicated buffer. A copy of the first four bytes of the string is
+stored inline in the prefix, after the length. This prefix enables a
+profitable fast path for string comparisons, which are frequently determined
+within the first four bytes.
+
+Views must be aligned to an 8-byte boundary. This restriction enables more
+efficient interoperation with systems where the index and offset are replaced
+by a raw pointer. All integers (length, buffer index, and offset) are unsigned

Review Comment:
   I agree that the standard seems to have adopted signed integers for sizes in 
every other case. It seems very strange to impose that only for new types?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to