Re: [PR] GH-17682: [Format] Add Bool8 Canonical Extension Type [arrow]

via GitHub Thu, 18 Jul 2024 10:56:19 -0700


felipecrv commented on code in PR #43234:
URL: https://github.com/apache/arrow/pull/43234#discussion_r1683268705



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,28 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+8-bit Boolean
+====
+
+Bool8 represents a boolean value using 1 byte (8 bits) to store each value 
instead of only 1 bit as in
+the native Arrow Boolean type. Although less compact that the native 
representation, Bool8 may have
+better zero-copy compatibility with various systems that also store booleans 
using 1 byte.
+
+* Extension name: ``arrow.bool8``.
+
+* The storage type of this extension is ``Int8`` where:
+
+  * **false** is denoted by the value ``0``.
+  * **true** can be specified using any non-zero value.

Review Comment:
   > Doesn't this fall into the C++ semantics we described earlier? For a bool 
array comparison, this will cast each value to bool to perform the comparisons 
which will, as per C++ semantics, do precisely what you mentioned earlier to 
perform the normalization as needed. 
   
   Not really. `clang++` compiles `lhs == rhs` to this assembly [1]:
   
   ```asm
   Equals(bool, bool):                            # @Equals(bool, bool)
           mov     eax, edi
           xor     eax, esi
           xor     al, 1
           ret
   ```
   
   Which is equivalent to `(lhs ^ rhs) ^ 1`. So if you call `Equals` [2] like 
this:
   
   ```cpp
   bool Equals(bool lhs, bool rhs) {
       return lhs == rhs;
   }
   
   void VectorEquals(int n, const bool *lhs, const bool* rhs, bool *out) {
       for (int i = 0; i < n; i++) {
           // UBSAN builds will have instrumentation here to check that
           // the byte from memory being passed to a register as a "bool"
           // is in the valid range. Release builds don't care.
           out[i] = Equals(lhs[i], rhs[i]);
       }
   }
   
   void load_and_compare(int n) {
       // This cast can lead to UB if bytes are out of the [0, 1] range.
       auto *lhs = reinterpret_cast<const bool*>(LoadInt8Buffer(n));
       auto *rhs = reinterpret_cast<const bool*>(LoadInt8Buffer(n));
       auto *out = AllocBools(n);
       VectorEquals(n, lhs, rhs, out);
       PrintBools(n, out);
   }
   ```
   
   So let's say we load the `lhs` register with `2` (denoting `true`) and the 
`rhs` with `0` (denoting `false`). `Equals` of a value denoting `true` and a 
value denoting `false` should return a value denoting `false`, but:
   
   ```
   (lhs ^ rhs) ^ 1 =
   (10b ^ 00b) ^ 01b =
      10b      ^ 01b =
           11b  (3 evaluates to true!!)
   ```
   
   [1] https://godbolt.org/z/3sM4debdT
   [2] if libcudf is anything like Arrow code, that is how they would call the 
comparison function in a kernel loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-17682: [Format] Add Bool8 Canonical Extension Type [arrow]

Reply via email to