[GitHub] [arrow] pitrou commented on a diff in pull request #35628: GH-35627: [C++][Format][Integration] Add string view to the arrow format

via GitHub Thu, 24 Aug 2023 07:09:54 -0700


pitrou commented on code in PR #35628:
URL: https://github.com/apache/arrow/pull/35628#discussion_r1304322845



##########
cpp/src/parquet/statistics.cc:
##########
@@ -437,6 +437,10 @@ std::pair<ByteArray, ByteArray> GetMinMaxBinaryHelper(
   if (::arrow::is_binary_like(values.type_id())) {
     ::arrow::VisitArraySpanInline<::arrow::BinaryType>(
         *values.data(), std::move(valid_func), std::move(null_func));
+  } else if (values.type_id() == ::arrow::Type::BINARY_VIEW ||
+             values.type_id() == ::arrow::Type::STRING_VIEW) {
+    ::arrow::VisitArraySpanInline<::arrow::BinaryViewType>(

Review Comment:
   Perhaps define a `is_binary_view_like`?



##########
format/Message.fbs:
##########
@@ -99,6 +99,12 @@ table RecordBatch {
 
   /// Optional compression of the message body
   compression: BodyCompression;
+
+  /// Some types such as Utf8View are represented using a variable number of 
buffers.
+  /// For each such Field in the pre-ordered flattened logical schema, there 
will be
+  /// an entry in variadicCounts to indicate the number of extra buffers which 
belong
+  /// to that Field.
+  variadicCounts: [long];

Review Comment:
   The name is too vague, can we call it something like `extraBufferCounts`?



##########
cpp/src/arrow/util/string_header.h:
##########
@@ -0,0 +1,377 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/*
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <array>
+#include <cassert>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <string>
+#include <string_view>
+
+namespace arrow {
+
+/// Variable length string or binary with 4 byte prefix and inline optimization
+/// for small values (12 bytes or fewer). This is similar to std::string_view
+/// except that the referenced is limited in size to UINT32_MAX and up to the
+/// first four bytes of the string are copied into the struct. The prefix 
allows
+/// failing comparisons early and can reduce the CPU cache working set when
+/// dealing with short strings.
+///
+/// This structure supports three states:
+///
+/// Short string   |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix remaining in-line portion, zero padded
+///
+/// Long string    |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix raw pointer to out-of-line portion
+///
+/// IO Long string |----|----|----|----|
+///                 ^    ^      ^     ^
+///                 |    |      |     `----------.
+///                 size prefix buffer index and offset to out-of-line portion
+///
+/// Adapted from TU Munich's UmbraDB [1], Velox, DuckDB.
+///
+/// [1]: https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
+///
+/// There is no way to determine from a non-inline StringHeader whether it 
refers
+/// to its out-of-line portion with a raw pointer or with index/offset. This
+/// information is stored at the column level; so a buffer of StringHeader will
+/// contain only one or the other. In general unless a StringHeader is resident
+/// in a StringView array's buffer it will refer to out-of-line data with a raw
+/// pointer. This default is assumed by several members of StringHeader such as
+/// operator==() and operator string_view() since these and other operations 
cannot
+/// be performed on index/offset StringHeaders without also accessing the 
buffers
+/// storing their out-of-line data. Which states pertain to each accessor and
+/// constructor are listed in their comments.
+struct alignas(8) StringHeader {

Review Comment:
   Can we please call this `StringViewHeader`? This is specific to the 
binary-view-like types.
   



##########
cpp/src/arrow/util/string_header.h:
##########
@@ -0,0 +1,377 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/*
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <array>
+#include <cassert>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <string>
+#include <string_view>
+
+namespace arrow {
+
+/// Variable length string or binary with 4 byte prefix and inline optimization
+/// for small values (12 bytes or fewer). This is similar to std::string_view
+/// except that the referenced is limited in size to UINT32_MAX and up to the
+/// first four bytes of the string are copied into the struct. The prefix 
allows
+/// failing comparisons early and can reduce the CPU cache working set when
+/// dealing with short strings.
+///
+/// This structure supports three states:
+///
+/// Short string   |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix remaining in-line portion, zero padded
+///
+/// Long string    |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix raw pointer to out-of-line portion

Review Comment:
   Ping @bkietz . Either raw pointers are a separate PR, or at least they 
should not use the same payload type (`StringViewHeader` vs. something else).
   
   Note that using raw pointers is a property of the array's DataType, not a 
per-value property inside the array.



##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,51 @@ will be represented as follows: ::
     |----------------|----------------------|
     | joemark        | unspecified          |
 
+Variable-size Binary View Layout
+--------------------------------
+
+Each value in this layout consists of 0 or more bytes. These characters'
+locations are indicated using a **views** buffer, which may point to one
+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:

Review Comment:
   Not really. Each packed struct's field should have its endianness handled 
independently.
   For example, short strings would be:
   * in little-endian mode:
     * bytes 0-3 (little-endian int32)
     * data: bytes 4-15 (no endianness)
   * in big-endian mode:
     * bytes 0-3 (big-endian int32)
     * data: bytes 4-15 (no endianness)
   



##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,51 @@ will be represented as follows: ::
     |----------------|----------------------|
     | joemark        | unspecified          |
 
+Variable-size Binary View Layout
+--------------------------------
+
+Each value in this layout consists of 0 or more bytes. These characters'
+locations are indicated using a **views** buffer, which may point to one
+of potentially several **data** buffers or may contain the characters
+inline.
+
+The views buffer contains `length` view structures with the following layout:
+
+::
+
+    * Short strings, length <= 12
+      | Bytes 0-3  | Bytes 4-15                            |
+      |------------|---------------------------------------|
+      | length     | data (padded with 0)                  |
+
+    * Long strings, length > 12
+      | Bytes 0-3  | Bytes 4-7  | Bytes 8-11 | Bytes 12-15 |
+      |------------|------------|------------|-------------|
+      | length     | prefix     | buf. index | offset      |
+
+In both the long and short string cases, the first four bytes encode the
+length of the string and can be used to determine how the rest of the view
+should be interpreted.
+
+In the short string case the string's bytes are inlined- stored inside the
+view itself, in the twelve bytes which follow the length.
+
+In the long string case, a buffer index indicates which character buffer
+stores the characters and an offset indicates where in that buffer the
+characters begin. Buffer index 0 refers to the first character buffer, IE
+the first buffer **after** the validity buffer and the views buffer.
+The half-open range ``[offset, offset + length)`` must be entirely contained
+within the indicated buffer. A copy of the first four bytes of the string is
+stored inline in the prefix, after the length. This prefix enables a
+profitable fast path for string comparisons, which are frequently determined
+within the first four bytes.
+

Review Comment:
   Ping @bkietz . This doesn't seem to match the current C++ binary view tests, 
which state that:
   > Invalid string views which are masked by a null bit do not cause 
validation to fail
   



##########
cpp/src/arrow/compute/kernel.cc:
##########
@@ -249,9 +249,35 @@ class LargeBinaryLikeMatcher : public TypeMatcher {
   std::string ToString() const override { return "large-binary-like"; }
 };
 
+std::shared_ptr<TypeMatcher> LargeBinaryLike() {
+  return std::make_shared<LargeBinaryLikeMatcher>();
+}
+
+class BinaryViewLikeMatcher : public TypeMatcher {

Review Comment:
   Ping on this.



##########
cpp/src/arrow/array/array_base.cc:
##########
@@ -87,6 +87,10 @@ struct ScalarFromArraySlotImpl {
     return Finish(a.GetString(index_));
   }
 
+  Status Visit(const BinaryViewArray& a) {
+    return Finish(std::string{a.GetView(index_)});

Review Comment:
   Why not expose a `GetString` for consistency?



##########
cpp/src/arrow/util/string_conversion_benchmark.cc:
##########
@@ -0,0 +1,252 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/unreachable.h"

Review Comment:
   Let's add these benchmarks in another PR? Limiting a PR's size makes it more 
easily reviewable.



##########
cpp/src/arrow/util/string_header.cc:
##########
@@ -0,0 +1,29 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <iostream>
+
+#include "arrow/util/string_header.h"
+
+namespace arrow {
+
+std::ostream& operator<<(std::ostream& os, const StringHeader& header) {
+  os.write(header.data(), header.size());

Review Comment:
   This is entirely broken for IPC-compatible StringViews, right? (`data()` 
using the embedded raw pointer)



##########
format/Schema.fbs:
##########
@@ -171,6 +172,17 @@ table LargeUtf8 {
 table LargeBinary {
 }
 
+/// Same as Utf8, but string characters are delimited with a packed
+/// length/pointer instead of offsets.

Review Comment:
   Hmm, there will certainly not be a pointer in the IPC format, so this should 
be rephrased.
   
   Perhaps:
   ```
   Logically the same as Utf8, but the internal representation uses a packed 
struct
   that can contain either the entire string (for small strings) or an inlined 
prefix
   adorned with an index and offset pointing to a slice of another buffer (for 
non-small strings).
   ```
   



##########
docs/source/format/Columnar.rst:
##########
@@ -350,6 +352,51 @@ will be represented as follows: ::
     |----------------|----------------------|
     | joemark        | unspecified          |
 
+Variable-size Binary View Layout
+--------------------------------
+

Review Comment:
   Can we mention that this appeared in format version 1.4?
   For example, using Sphinx's [dedicated 
markup](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#directive-versionadded):
   ```suggestion
   .. versionadded:: Arrow Columnar Format 1.4
   
   ```



##########
dev/archery/archery/integration/datagen.py:
##########
@@ -743,6 +763,72 @@ class LargeStringColumn(_BaseStringColumn, 
_LargeOffsetsMixin):
     pass
 
 
+class BinaryViewColumn(PrimitiveColumn):
+
+    def _encode_value(self, x):
+        return frombytes(binascii.hexlify(x).upper())
+
+    def _get_buffers(self):
+        char_buffers = []
+        DEFAULT_BUFFER_SIZE = 32  # ¯\_(ツ)_/¯
+        INLINE_SIZE = 12
+
+        data = []
+        for i, v in enumerate(self.values):
+            if not self.is_valid[i]:
+                v = b''
+            assert isinstance(v, bytes)
+
+            if len(v) > INLINE_SIZE:
+                offset = 0
+                if len(v) > DEFAULT_BUFFER_SIZE:
+                    char_buffers.append(v)
+                else:
+                    if len(char_buffers) == 0:
+                        char_buffers.append(v)
+                    elif len(char_buffers[-1]) + len(v) > DEFAULT_BUFFER_SIZE:
+                        char_buffers.append(v)
+                    else:
+                        offset = len(char_buffers[-1])
+                        char_buffers[-1] += v
+                        assert len(char_buffers[-1]) <= DEFAULT_BUFFER_SIZE
+
+                buffer_index = len(char_buffers) - 1
+
+                # the prefix is always 4 bytes so it may not be utf-8
+                # even if the whole string view is
+                prefix = v[:4].ljust(4, b'\0')
+                prefix = frombytes(binascii.hexlify(prefix).upper())
+
+                data.append(OrderedDict([
+                    ('SIZE', len(v)),
+                    ('PREFIX', prefix),

Review Comment:
   Call this "PREFIX_HEX" for clarity?



##########
dev/archery/archery/integration/datagen.py:
##########
@@ -743,6 +763,72 @@ class LargeStringColumn(_BaseStringColumn, 
_LargeOffsetsMixin):
     pass
 
 
+class BinaryViewColumn(PrimitiveColumn):
+
+    def _encode_value(self, x):
+        return frombytes(binascii.hexlify(x).upper())
+
+    def _get_buffers(self):
+        char_buffers = []
+        DEFAULT_BUFFER_SIZE = 32  # ¯\_(ツ)_/¯

Review Comment:
   I suppose this small value is meant to easily produce multiple buffers? Add 
a non-obscure comment?



##########
docs/source/format/Columnar.rst:
##########
@@ -885,6 +932,7 @@ of memory buffers for each layout.
 
    "Primitive",validity,data,
    "Variable Binary",validity,offsets,data
+   "Variable Binary - View",validity,offsets,*multiple_data_buffers

Review Comment:
   I'm not sure this renders nicely and/or is easy to read. Perhaps add a new 
column "Extra Buffers" to the table?



##########
docs/source/format/Columnar.rst:
##########
@@ -885,6 +932,7 @@ of memory buffers for each layout.
 
    "Primitive",validity,data,
    "Variable Binary",validity,offsets,data
+   "Variable Binary - View",validity,offsets,*multiple_data_buffers

Review Comment:
   Also, "offsets" is not exact here. Perhaps "packed headers" instead?



##########
dev/archery/archery/integration/datagen.py:
##########
@@ -743,6 +763,72 @@ class LargeStringColumn(_BaseStringColumn, 
_LargeOffsetsMixin):
     pass
 
 
+class BinaryViewColumn(PrimitiveColumn):
+
+    def _encode_value(self, x):
+        return frombytes(binascii.hexlify(x).upper())
+
+    def _get_buffers(self):
+        char_buffers = []
+        DEFAULT_BUFFER_SIZE = 32  # ¯\_(ツ)_/¯
+        INLINE_SIZE = 12
+
+        data = []
+        for i, v in enumerate(self.values):
+            if not self.is_valid[i]:
+                v = b''
+            assert isinstance(v, bytes)
+
+            if len(v) > INLINE_SIZE:
+                offset = 0
+                if len(v) > DEFAULT_BUFFER_SIZE:
+                    char_buffers.append(v)
+                else:
+                    if len(char_buffers) == 0:

Review Comment:
   Can be an `elif` chain all the way down.
   ```suggestion
                   elif len(char_buffers) == 0:
   ```



##########
cpp/src/arrow/util/string_header.h:
##########
@@ -0,0 +1,377 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+/*
+ * Copyright (c) Facebook, Inc. and its affiliates.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#pragma once
+
+#include <array>
+#include <cassert>
+#include <cstdint>
+#include <cstring>
+#include <iosfwd>
+#include <string>
+#include <string_view>
+
+namespace arrow {
+
+/// Variable length string or binary with 4 byte prefix and inline optimization
+/// for small values (12 bytes or fewer). This is similar to std::string_view
+/// except that the referenced is limited in size to UINT32_MAX and up to the
+/// first four bytes of the string are copied into the struct. The prefix 
allows
+/// failing comparisons early and can reduce the CPU cache working set when
+/// dealing with short strings.
+///
+/// This structure supports three states:
+///
+/// Short string   |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix remaining in-line portion, zero padded
+///
+/// Long string    |----|----|--------|
+///                 ^    ^      ^
+///                 |    |      |
+///                 size prefix raw pointer to out-of-line portion
+///
+/// IO Long string |----|----|----|----|
+///                 ^    ^      ^     ^
+///                 |    |      |     `----------.
+///                 size prefix buffer index and offset to out-of-line portion
+///
+/// Adapted from TU Munich's UmbraDB [1], Velox, DuckDB.
+///
+/// [1]: https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
+///
+/// There is no way to determine from a non-inline StringHeader whether it 
refers
+/// to its out-of-line portion with a raw pointer or with index/offset. This
+/// information is stored at the column level; so a buffer of StringHeader will
+/// contain only one or the other. In general unless a StringHeader is resident
+/// in a StringView array's buffer it will refer to out-of-line data with a raw
+/// pointer. This default is assumed by several members of StringHeader such as
+/// operator==() and operator string_view() since these and other operations 
cannot
+/// be performed on index/offset StringHeaders without also accessing the 
buffers
+/// storing their out-of-line data. Which states pertain to each accessor and
+/// constructor are listed in their comments.
+struct alignas(8) StringHeader {
+ public:
+  using value_type = char;
+
+  static constexpr size_t kTotalSize = 16;
+  static constexpr size_t kSizeSize = sizeof(uint32_t);
+  static constexpr size_t kIndexOffsetSize = sizeof(uint32_t) * 2;
+  static constexpr size_t kPrefixSize = kTotalSize - kSizeSize - 
kIndexOffsetSize;
+  static_assert(kPrefixSize == 4);
+  static constexpr size_t kInlineSize = kTotalSize - kPrefixSize;
+  static_assert(kInlineSize == 12);
+
+  /// Construct an empty view.
+  StringHeader() = default;
+
+  /// Construct a RAW POINTER view.
+  StringHeader(const char* data, size_t len) : 
size_(static_cast<uint32_t>(len)) {
+    if (size_ == 0) return;
+
+    // TODO(bkietz) better option than assert?
+    assert(data);
+    if (IsInline()) {
+      // small string: inlined. Bytes beyond size_ are already 0
+      memcpy(prefix_.data(), data, size_);
+    } else {
+      // large string: store pointer
+      memcpy(prefix_.data(), data, kPrefixSize);
+      value_.data = data;
+    }
+  }
+
+  /// Construct a RAW POINTER view.
+  StringHeader(const uint8_t* data, int64_t len)
+      : StringHeader(reinterpret_cast<const char*>(data), 
static_cast<size_t>(len)) {}
+
+  /// Convenience implicit constructor for RAW POINTER views from C 
string/string literal.
+  ///
+  /// NOLINTNEXTLINE runtime/explicit
+  StringHeader(const char* data) : StringHeader(data, std::strlen(data)) {}
+
+  /// Construct a RAW POINTER view.
+  explicit StringHeader(const std::string& value)
+      : StringHeader(value.data(), value.size()) {}
+
+  /// Construct a RAW POINTER view.
+  explicit StringHeader(std::string_view value)
+      : StringHeader(value.data(), value.size()) {}
+
+  /// Construct an INDEX/OFFSET view.
+  StringHeader(const char* data, uint32_t len, uint32_t buffer_index,
+               const char* buffer_data)
+      : size_(len) {
+    if (size_ == 0) return;
+
+    // TODO(bkietz) better option than assert?
+    assert(data);
+    if (IsInline()) {
+      // small string: inlined. Bytes beyond size_ are already 0
+      memcpy(prefix_.data(), data, size_);
+    } else {
+      // large string: store index/offset
+      memcpy(prefix_.data(), data, kPrefixSize);
+      SetIndexOffset(buffer_index, static_cast<uint32_t>(data - buffer_data));
+    }
+  }
+
+  /// Construct an INDEX/OFFSET view.
+  StringHeader(uint32_t len, std::array<char, kPrefixSize> prefix, uint32_t 
buffer_index,
+               uint32_t offset)
+      : size_(len), prefix_(prefix) {
+    SetIndexOffset(buffer_index, offset);
+  }
+
+  template <typename I>
+  static constexpr bool IsInline(I size) {
+    return size <= static_cast<I>(kInlineSize);
+  }
+  static constexpr bool IsInline(uint32_t size) { return size <= kInlineSize; }
+
+  /// True if the view's data is entirely stored inline.
+  /// This function is safe for use against both RAW POINTER and INDEX/OFFSET 
views.
+  bool IsInline() const { return IsInline(size_); }
+
+  /// Return a RAW POINTER view's data.
+  const char* data() const& { return IsInline() ? prefix_.data() : 
value_.data; }
+  const char* data() && = delete;
+
+  /// The number of characters viewed by this StringHeader.
+  /// This function is safe for use against both RAW POINTER and INDEX/OFFSET 
views.
+  size_t size() const { return size_; }
+
+  /// Print a RAW POINTER view to a std::ostream.
+  friend std::ostream& operator<<(std::ostream& os, const StringHeader& 
header);
+
+  /// Equality comparison between RAW POINTER views.
+  bool operator==(const StringHeader& other) const {
+    // Compare lengths and first 4 characters.
+    if (SizeAndPrefixAsInt64() != other.SizeAndPrefixAsInt64()) {
+      return false;
+    }
+    if (IsInline()) {
+      // The inline part is zeroed at construction, so we can compare
+      // a word at a time if data extends past 'prefix_'.
+      return size_ <= kPrefixSize || InlinedAsInt64() == 
other.InlinedAsInt64();
+    }
+    // Sizes are equal and this is not inline, therefore both are out
+    // of line and have kPrefixSize first in common.
+    return memcmp(value_.data + kPrefixSize, other.value_.data + kPrefixSize,
+                  size_ - kPrefixSize) == 0;
+  }
+
+  /// Inequality comparison between RAW POINTER views.
+  bool operator!=(const StringHeader& other) const { return !(*this == other); 
}
+
+  /// Less-than comparison between RAW POINTER views.
+  bool operator<(const StringHeader& other) const { return Compare(other) < 0; 
}
+
+  /// Less-than-or-equal comparison between RAW POINTER views.
+  bool operator<=(const StringHeader& other) const { return Compare(other) <= 
0; }
+
+  /// Greater-than comparison between RAW POINTER views.
+  bool operator>(const StringHeader& other) const { return Compare(other) > 0; 
}
+
+  /// Greater-than-or-equal comparison between RAW POINTER views.
+  bool operator>=(const StringHeader& other) const { return Compare(other) >= 
0; }
+
+  /// Conversion to std::string_view for RAW POINTER views.
+  explicit operator std::string_view() const& { return {data(), size()}; }
+  explicit operator std::string_view() && = delete;
+
+  /// Return the always-inline cached first 4 bytes of this StringHeader.
+  /// This function is safe for use against both RAW POINTER and INDEX/OFFSET 
views.
+  std::array<char, kPrefixSize> GetPrefix() const { return prefix_; }
+
+  /// Return an INDEX/OFFSET view's buffer index.
+  uint32_t GetBufferIndex() const { return value_.io_data.buffer_index; }
+
+  /// Return an INDEX/OFFSET view's buffer offset.
+  uint32_t GetBufferOffset() const { return value_.io_data.offset; }
+
+  /// Return a RAW POINTER view's data pointer.
+  ///
+  /// NOT VALID FOR INLINE VIEWS.
+  const char* GetRawPointer() const { return value_.data; }
+
+  /// Return an INDEX/OFFSET view's data pointer.
+  ///
+  /// NOT VALID FOR INLINE VIEWS.
+  template <typename BufferPtr>
+  const char* GetPointerFromBuffers(const BufferPtr* char_buffers) const {
+    return char_buffers[GetBufferIndex()]->template data_as<char>() + 
GetBufferOffset();
+  }
+
+  /// Return an INDEX/OFFSET view's data pointer.
+  template <typename BufferPtr>
+  const char* GetPointerFromBuffersOrInlineData(const BufferPtr* char_buffers) 
const {
+    return IsInline() ? GetInlineData() : GetPointerFromBuffers(char_buffers);
+  }
+
+  /// Return a the inline data of a view.
+  ///
+  /// For inline views, this points to the entire data of the view.
+  /// For other views, this points to the 4 byte prefix.
+  const char* GetInlineData() const& { return prefix_.data(); }
+  const char* GetInlineData() && = delete;
+
+  /// Mutate into a RAW POINTER view.
+  ///
+  /// This function is only intended for use in converting from an equivalent 
INDEX/OFFSET
+  /// view; in particular it does not check or modify the prefix for 
consistency with the
+  /// new data pointer.
+  void SetRawPointer(const char* data) { value_.data = data; }
+
+  /// Mutate into an INDEX/OFFSET view.
+  ///
+  /// This function is only intended for use in converting from an equivalent 
RAW POINTER
+  /// view; in particular it does not check or modify the prefix for 
consistency with the
+  /// new buffer index/offset.
+  void SetIndexOffset(uint32_t buffer_index, uint32_t offset) {
+    value_.io_data = {buffer_index, offset};
+  }
+
+  /// Equality compare an INDEX/OFFSET view in place.
+  ///
+  /// Equivalent comparison will be accomplished by (for example) first 
converting both
+  /// views to std::string_view and comparing those, but this would not take 
advantage
+  /// of the cached 4 byte prefix.
+  template <typename BufferPtr>
+  bool EqualsIndexOffset(const BufferPtr* char_buffers, const StringHeader& 
other,
+                         const BufferPtr* other_char_buffers) const {
+    if (SizeAndPrefixAsInt64() != other.SizeAndPrefixAsInt64()) {
+      return false;
+    }
+    if (IsInline()) {
+      return InlinedAsInt64() == other.InlinedAsInt64();
+    }
+    // Sizes are equal and this is not inline, therefore both are out of line 
and we
+    // have already checked that their kPrefixSize first characters are equal.
+    return memcmp(GetPointerFromBuffers(char_buffers) + kPrefixSize,
+                  other.GetPointerFromBuffers(other_char_buffers) + 
kPrefixSize,
+                  size() - kPrefixSize) == 0;
+  }
+
+  /// Less-than compare an INDEX/OFFSET view in place.
+  ///
+  /// Equivalent comparison will be accomplished by (for example) first 
converting both
+  /// views to std::string_view and comparing those, but this would not take 
advantage
+  /// of the cached 4 byte prefix.
+  template <typename BufferPtr>
+  bool LessThanIndexOffset(const BufferPtr* char_buffers, const StringHeader& 
other,
+                           const BufferPtr* other_char_buffers) const {
+    return CompareIndexOffset(char_buffers, other, other_char_buffers) < 0;
+  }
+
+ private:
+  // Returns 0, if this == other
+  //       < 0, if this < other
+  //       > 0, if this > other
+  int32_t Compare(const StringHeader& other) const {
+    if (PrefixAsInt() != other.PrefixAsInt()) {
+      // The result is decided on prefix. The shorter will be less
+      // because the prefix is padded with zeros.
+      return memcmp(prefix_.data(), other.prefix_.data(), kPrefixSize);
+    }
+    int32_t size = std::min(size_, other.size_) - kPrefixSize;
+    if (size <= 0) {
+      // One string is just the prefix.
+      return size_ - other.size_;
+    }
+    if (static_cast<uint32_t>(size) <= kInlineSize && IsInline() && 
other.IsInline()) {
+      int32_t result = memcmp(value_.inlined.data(), 
other.value_.inlined.data(), size);
+      return (result != 0) ? result : size_ - other.size_;
+    }
+    int32_t result = memcmp(data() + kPrefixSize, other.data() + kPrefixSize, 
size);
+    return (result != 0) ? result : size_ - other.size_;
+  }
+
+  template <typename BufferPtr>
+  int CompareIndexOffset(const BufferPtr* char_buffers, const StringHeader& 
other,
+                         const BufferPtr* other_char_buffers) const {
+    if (PrefixAsInt() != other.PrefixAsInt()) {
+      // The result is decided on prefix. The shorter will be less
+      // because the prefix is padded with zeros.
+      return memcmp(prefix_.data(), other.prefix_.data(), kPrefixSize);
+    }
+    int32_t size = std::min(size_, other.size_) - kPrefixSize;
+    if (size <= 0) {
+      // One string is just the prefix.
+      return size_ - other.size_;
+    }
+    if (static_cast<uint32_t>(size) <= kInlineSize && IsInline() && 
other.IsInline()) {
+      int32_t result = memcmp(value_.inlined.data(), 
other.value_.inlined.data(), size);
+      return (result != 0) ? result : size_ - other.size_;
+    }
+
+    int32_t result = memcmp(
+        GetPointerFromBuffersOrInlineData(char_buffers) + kPrefixSize,
+        other.GetPointerFromBuffersOrInlineData(other_char_buffers) + 
kPrefixSize, size);
+    return (result != 0) ? result : size_ - other.size_;
+  }
+
+  int64_t SizeAndPrefixAsInt64() const {
+    return reinterpret_cast<const int64_t*>(this)[0];
+  }
+
+  int64_t InlinedAsInt64() const { return reinterpret_cast<const 
int64_t*>(this)[1]; }
+
+  int32_t PrefixAsInt() const { return *reinterpret_cast<const 
int32_t*>(&prefix_); }
+
+  // FIXME(bkietz) replace this with a std::array<uint8_t, 16> and forgo the 
union.
+  // Type punning (AKA violation of the strict aliasing rule) is undefined 
behavior.
+  // Using memcpy to access the bytes of the object representation of 
trivially copyable
+  // objects is not undefined behavior. Given sufficiently explicit hints on 
alignment
+  // and size, compilers elide memcpy calls in favor of identical assembly to 
what
+  // the type punning implementation produces.
+  // We rely on all members being laid out top to bottom . C++
+  // guarantees this.
+  uint32_t size_ = 0;
+  std::array<char, kPrefixSize> prefix_ = {0};
+  union {
+    std::array<char, 8> inlined = {0};
+    const char* data;
+    struct {
+      uint32_t buffer_index;
+      uint32_t offset;

Review Comment:
   Ping on this. I think we should not deviate arbitrarily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a diff in pull request #35628: GH-35627: [C++][Format][Integration] Add string view to the arrow format

Reply via email to