[GitHub] [arrow] bkietz commented on a diff in pull request #37526: GH-35627: [Format][Integration] Add string-view to arrow format

via GitHub Tue, 12 Sep 2023 07:34:38 -0700


bkietz commented on code in PR #37526:
URL: https://github.com/apache/arrow/pull/37526#discussion_r1323141953



##########
dev/archery/archery/integration/datagen.py:
##########
@@ -743,6 +763,82 @@ class LargeStringColumn(_BaseStringColumn, 
_LargeOffsetsMixin):
     pass
 
 
+class BinaryViewColumn(PrimitiveColumn):
+
+    def _encode_value(self, x):
+        return frombytes(binascii.hexlify(x).upper())
+
+    def _get_buffers(self):
+        char_buffers = []
+        # a small default char buffer size is used so we get multiple
+        # character buffers without massive arrays
+        DEFAULT_BUFFER_SIZE = 32
+        INLINE_SIZE = 12
+
+        data = []
+        for i, v in enumerate(self.values):
+            if not self.is_valid[i]:
+                v = b''
+            assert isinstance(v, bytes)
+
+            if len(v) > INLINE_SIZE:
+                offset = 0
+                if len(v) > DEFAULT_BUFFER_SIZE:
+                    # This string doesn't fit into a default sized char buffer;
+                    # add it whole as a self-contained character buffer.
+                    char_buffers.append(v)
+                elif len(char_buffers) == 0:
+                    # No character buffers have been added yet;
+                    # add this string whole (we may append to it later).
+                    char_buffers.append(v)
+                elif len(char_buffers[-1]) + len(v) > DEFAULT_BUFFER_SIZE:
+                    # Appending this string to the current active char buffer
+                    # would overflow the default buffer size; add it whole.
+                    char_buffers.append(v)
+                else:
+                    # Append this string to the current active char buffer.
+                    offset = len(char_buffers[-1])
+                    char_buffers[-1] += v
+                    # Sanity check that we haven't produced a char buffer
+                    # longer than the default:
+                    assert len(char_buffers[-1]) <= DEFAULT_BUFFER_SIZE
+
+                buffer_index = len(char_buffers) - 1
+
+                # the prefix is always 4 bytes so it may not be utf-8
+                # even if the whole string view is
+                prefix = v[:4].ljust(4, b'\0')
+                prefix = frombytes(binascii.hexlify(prefix).upper())
+
+                data.append(OrderedDict([
+                    ('SIZE', len(v)),
+                    ('PREFIX_HEX', prefix),
+                    ('BUFFER_INDEX', buffer_index),
+                    ('OFFSET', offset),
+                ]))

Review Comment:
   Oh, I see. This is the encoding of the *views* rather than of the variadic 
buffers. The views buffer is an array of structs in IPC, so I was copying that 
here. I'll rename the buffers `DATA -> VIEWS, VARIADIC_BUFFERS -> 
VARIADIC_DATA_BUFFERS` which I think would've avoided this confusion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a diff in pull request #37526: GH-35627: [Format][Integration] Add string-view to arrow format

Reply via email to