Re: [PR] GH-17682: [Format] Add Bool8 Canonical Extension Type [arrow]

via GitHub Wed, 17 Jul 2024 17:28:49 -0700


felipecrv commented on code in PR #43234:
URL: https://github.com/apache/arrow/pull/43234#discussion_r1681923324



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,28 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+8-bit Boolean
+====
+
+Bool8 represents a boolean value using 1 byte (8 bits) to store each value 
instead of only 1 bit as in
+the native Arrow Boolean type. Although less compact that the native 
representation, Bool8 may have
+better zero-copy compatibility with various systems that also store booleans 
using 1 byte.
+
+* Extension name: ``arrow.bool8``.
+
+* The storage type of this extension is ``Int8`` where:
+
+  * **false** is denoted by the value ``0``.
+  * **true** can be specified using any non-zero value.

Review Comment:
   >  I don't think we gain anything specifically by requiring producers to 
provide values in [0, 1].
   
   @zeroshade We gain **the ability to say who is responsible for 
normalization** when there is a conflict in a bidirectional relationship (two 
systems that are both consumer and producer at the same time). This is what C++ 
had to do: int->bool normalizes while bool->int takes whatever is on the byte. 
Being loose on both directions is not the globally efficient solution because 
now both sides have to be paranoid instead of just one.
   
   > While low-level C performance can be gained by multiplying with true and 
false for branch free programming, in the context of a compute kernel I think 
we shouldn't rely on that behavior. Most compute systems disallow multiplying 
bools anyways. 
   
   Multiplying was just an example. Think about hashing/fingerprinting a 
`Bool8` array as another example where assuming normalized values is 
beneficial. An even more realistic example is comparing `Bool8` arrays for 
equality: if I implement `==(Bool8, Bool8)` as a simple fallback to `==(Int8, 
Int8)` (a very reasonable implementation) and it fails because arrays are not 
normalized I would ask the producers of the `Bool8` arrays to be more 
conservative in what they produce.
   
   I believe this is how libcudf implements bool array comparison. This only 
works if producers of bool arrays stick to producing values in the `[0, 1]` 
range.
   
   
https://github.com/rapidsai/cudf/blob/34dea6fe40fc20966b48257853865111df4a687f/cpp/include/cudf/table/experimental/row_operators.cuh#L1293-L1297
   
   At least one array producer in the codebase is producing arrays with values 
in the `[0, 1]` range — 
https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/transform/mask_to_bools.cu#L55.
 The `bool` returned becomes `0` or `1` per CUDA/C++ conventions.
   
   > Allowing zero-copy casts between Bool8 and Int8 without requiring a pass 
to check values and convert them is a good benefit I think.
   
   A pass is not necessary because more often than not, the process that is 
producing 8-bit booleans is already producing 0s and 1s.
   
   > Allowing zero-copy casts between Bool8 and Int8 without requiring a pass 
to check values and convert them is a good benefit I think.
   
   All we need is the producer to have the equivalent of 
`reinterpret_cast<const bool*>(buffer)` to convert an `int8` array into a 
`Bool8` array without a pass that checks the values. This `reinterpret_cast` 
doesn't check byte by byte because the user is supposed to guarantee by 
construction of the buffer that the values on the `buffer` are in [0,1].
   
   > @felipecrv I think we're talking about mostly the same semantics but with 
slightly different phrasing.
   
   I think I'm being precious about a very nuanced topic (semantics + undefined 
behavior) and I'm not even able to convince others that this matters in the 
first place ;)
   
   > 1. I did some 
[investigation](https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b)
 into how numpy handles this in the context of zero-copy sharing with pyarrow. 
It appears numpy does in fact canonicalize boolean values to 0 and 1, but 
understands any nonzero value to be true without forcing a copy. This aligns 
well with the behavior we're discussing.
   
   This aligns with what I'm proposing. Normalize when you're producing and 
take non-zero to mean true.
   
   > 2. libcudf defines its [BOOL8 
type](https://docs.rapids.ai/api/libcudf/stable/group__utility__types#ggadf077607da617d1dadcc5417e2783539a05afd9eb8887a406d47474cd3809a5dd)
 as "Boolean using one byte per value, 0 == false, else true". It may be true 
that CUDF will often or even alway use the values 0 or 1 (I don't actually 
know), but it's not consistent with the documented behavior.
   
   The libcudf spec doesn't contradicts anything I proposed, but it 
under-specifies the semantics (IMO) by not saying that, when a choice has to be 
made, `1` and not `0xff` or `8` is used to represent `true`. And if they did 
that (produced non-0 values non-deterministically to represent `true`) a lot 
would break within cudf itself (see the link to the `==` comparison above).
   
   You don't have to say `MUST` in the spec. Just say that `1` is the value of 
choice when you're forced to pick a non-zero value to represent `true`.
   
   > In practice, this will almost always be 0 and 1, but I'm generally in 
favor of fewer restrictions on an interoperable protocol where possible given 
that our stated goal is zero-copy.
   
   I'm also going for flexibility here. I'm a fan of undefined-behavior and 
non-determinism as programming tools, but to maximize their effectiveness you 
have to say which side is responsible for making things deterministic when 
there is a conflict. While I say producers must pick `1` for true, I'm also 
hedging by saying that consumers should try to be robust against producers 
picking something other than `1`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-17682: [Format] Add Bool8 Canonical Extension Type [arrow]

Reply via email to