[GitHub] [arrow] westonpace commented on a diff in pull request #35565: GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers

via GitHub Thu, 18 May 2023 10:10:18 -0700


westonpace commented on code in PR #35565:
URL: https://github.com/apache/arrow/pull/35565#discussion_r1198071963



##########
cpp/src/arrow/util/align_util.cc:
##########
@@ -30,12 +32,120 @@ bool CheckAlignment(const Buffer& buffer, int64_t 
alignment) {
   return buffer.address() % alignment == 0;
 }
 
-bool CheckAlignment(const ArrayData& array, int64_t alignment) {
-  for (const auto& buffer : array.buffers) {
-    if (buffer) {
-      if (!CheckAlignment(*buffer, alignment)) return false;
+namespace {
+
+// Some buffers are frequently type-punned.  For example, in an int32 array the
+// values buffer is frequently cast to int32_t*
+//
+// This sort of punning is only valid if the pointer is aligned to a proper 
width
+// (e.g. 4 bytes in the case of int32).
+//
+// We generally assume that all buffers are at least 8-bit aligned and so we 
only
+// need to worry about buffers that are commonly cast to wider data types.  
Note that
+// this alignment is something that is guaranteed by malloc (e.g. new 
int32_t[] will
+// return a buffer that is 4 byte aligned) or common libraries (e.g. numpy) 
but it is
+// not currently guaranteed by flight (GH-32276).
+//
+// By happy coincedence, for every data type, the only buffer that might need 
wider
+// alignment is the second buffer (at index 1).  This function returns the 
expected
+// alignment (in bits) of the second buffer for the given array to safely 
allow this cast.
+//
+// If the array's type doesn't have a second buffer or the second buffer is 
not expected
+// to be type punned, then we return 8.
+int GetMallocValuesAlignment(const ArrayData& array) {
+  // Make sure to use the storage type id
+  auto type_id = array.type->storage_id();
+  if (type_id == Type::DICTIONARY) {
+    // The values buffer is in a different ArrayData and so we only check the 
indices
+    // buffer here.  The values array data will be checked by the calling 
method.
+    type_id = 
::arrow::internal::checked_pointer_cast<DictionaryType>(array.type)
+                  ->index_type()
+                  ->id();
+  }
+  switch (type_id) {
+    case Type::NA:                 // No buffers
+    case Type::FIXED_SIZE_LIST:    // No second buffer (values in child array)
+    case Type::FIXED_SIZE_BINARY:  // Fixed size binary could be dangerous but 
the
+                                   // compute kernels don't type pun this.  
E.g. if
+                                   // an extension type is storing some kind 
of struct
+                                   // here then the user should do their own 
alignment
+                                   // check before casting to an array of 
structs
+    case Type::BOOL:               // Always treated as uint8_t*
+    case Type::INT8:               // Always treated as uint8_t*
+    case Type::UINT8:              // Always treated as uint8_t*
+    case Type::DECIMAL128:         // Always treated as uint8_t*
+    case Type::DECIMAL256:         // Always treated as uint8_t*

Review Comment:
   Yes, but I don't think there is anywhere in the code where we cast the bytes 
to `Decimal128*`.  Although, now searching through, it seems we do sometimes do 
this in the TPCH Node:
   
   ```
   const Decimal128* l_tax = reinterpret_cast<const Decimal128*>(
       tld.lineitem[ibatch][LINEITEM::L_TAX].array()->buffers[1]->data());
   ```
   
   This doesn't justify alignment on its own but it's probably a good sign that 
we would do something like this were we to invest more in performant decimal 
kernels.  So maybe we should change the alignment requirement for decimals to 8 
which is what I think we'd need to support `WordArray`:
   
   ```
   template <typename Derived, int BIT_WIDTH, int NWORDS = BIT_WIDTH / 64>
   class ARROW_EXPORT GenericBasicDecimal {
   public:
     using WordArray = std::array<uint64_t, NWORDS>;
    protected:
     WordArray array_;
   };
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #35565: GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers

Reply via email to