felipecrv commented on code in PR #43234:
URL: https://github.com/apache/arrow/pull/43234#discussion_r1681923324
##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,28 @@ UUID
A specific UUID version is not required or guaranteed. This extension
represents
UUIDs as FixedSizeBinary(16) with big-endian notation and does not
interpret the bytes in any way.
+8-bit Boolean
+====
+
+Bool8 represents a boolean value using 1 byte (8 bits) to store each value
instead of only 1 bit as in
+the native Arrow Boolean type. Although less compact that the native
representation, Bool8 may have
+better zero-copy compatibility with various systems that also store booleans
using 1 byte.
+
+* Extension name: ``arrow.bool8``.
+
+* The storage type of this extension is ``Int8`` where:
+
+ * **false** is denoted by the value ``0``.
+ * **true** can be specified using any non-zero value.
Review Comment:
> I don't think we gain anything specifically by requiring producers to
provide values in [0, 1].
@zeroshade We gain **the ability to say who is responsible for
normalization** when there is a conflict in a bidirectional relationship (two
systems that are both consumer and producer at the same time). This is what C++
had to do: int->bool normalizes while bool->int takes whatever is on the byte.
Being loose on both directions is not the globally efficient solution because
now both sides have to be paranoid instead of just one.
> While low-level C performance can be gained by multiplying with true and
false for branch free programming, in the context of a compute kernel I think
we shouldn't rely on that behavior. Most compute systems disallow multiplying
bools anyways.
Multiplying was just an example. Think about hashing/fingerprinting a
`Bool8` array as another example where assuming normalized values is
beneficial. An even more realistic example is comparing `Bool8` arrays for
equality: if I implement `==(Bool8, Bool8)` as a simple fallback to `==(Int8,
Int8)` (a very reasonable implementation) and it fails because arrays are not
normalized I would ask the producers of the `Bool8` arrays to be more
conservative in what they produce.
I believe this is how libcudf implements bool array comparison. This only
works if producers of bool arrays stick to producing values in the `[0, 1]`
range.
https://github.com/rapidsai/cudf/blob/34dea6fe40fc20966b48257853865111df4a687f/cpp/include/cudf/table/experimental/row_operators.cuh#L1293-L1297
At least one array producer in the codebase is producing arrays with values
in the `[0, 1]` range —
https://github.com/rapidsai/cudf/blob/branch-24.08/cpp/src/transform/mask_to_bools.cu#L55.
The `bool` returned becomes `0` or `1` per CUDA/C++ conventions.
> Allowing zero-copy casts between Bool8 and Int8 without requiring a pass
to check values and convert them is a good benefit I think.
A pass is not necessary because more often than not, the process that is
producing 8-bit booleans is already producing 0s and 1s.
> Allowing zero-copy casts between Bool8 and Int8 without requiring a pass
to check values and convert them is a good benefit I think.
All we need is the producer to have the equivalent of
`reinterpret_cast<const bool*>(buffer)` to convert an `int8` array into a
`Bool8` array without a pass that checks the values. This `reinterpret_cast`
doesn't check byte by byte because the user is supposed to guarantee by
construction of the buffer that the values on the `buffer` are in [0,1].
> @felipecrv I think we're talking about mostly the same semantics but with
slightly different phrasing.
I think I'm being precious about a very nuanced topic (semantics + undefined
behavior) and I'm not even able to convince others that this matters in the
first place ;)
> 1. I did some
[investigation](https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b)
into how numpy handles this in the context of zero-copy sharing with pyarrow.
It appears numpy does in fact canonicalize boolean values to 0 and 1, but
understands any nonzero value to be true without forcing a copy. This aligns
well with the behavior we're discussing.
This aligns with what I'm proposing. Normalize when you're producing and
take non-zero to mean true.
> 2. libcudf defines its [BOOL8
type](https://docs.rapids.ai/api/libcudf/stable/group__utility__types#ggadf077607da617d1dadcc5417e2783539a05afd9eb8887a406d47474cd3809a5dd)
as "Boolean using one byte per value, 0 == false, else true". It may be true
that CUDF will often or even alway use the values 0 or 1 (I don't actually
know), but it's not consistent with the documented behavior.
The libcudf spec doesn't contradicts anything I proposed, but it
under-specifies the semantics (IMO) by not saying that, when a choice has to be
made, `1` and not `0xff` or `8` is used to represent `true`. And if they did
that (produced non-0 values non-deterministically to represent `true`) a lot
would break within cudf itself (see the link to the `==` comparison above).
You don't have to say `MUST` in the spec. Just say that `1` is the value of
choice when you're forced to pick a non-zero value to represent `true`.
> In practice, this will almost always be 0 and 1, but I'm generally in
favor of fewer restrictions on an interoperable protocol where possible given
that our stated goal is zero-copy.
I'm also going for flexibility here. I'm a fan of undefined-behavior and
non-determinism as programming tools, but to maximize their effectiveness you
have to say which side is responsible for making things deterministic when
there is a conflict. While I say producers must pick `1` for true, I'm also
hedging by saying that consumers should try to be robust against producers
picking something other than `1`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]