joellubi commented on code in PR #43234:
URL: https://github.com/apache/arrow/pull/43234#discussion_r1681657179


##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,28 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+8-bit Boolean
+====
+
+Bool8 represents a boolean value using 1 byte (8 bits) to store each value 
instead of only 1 bit as in
+the native Arrow Boolean type. Although less compact that the native 
representation, Bool8 may have
+better zero-copy compatibility with various systems that also store booleans 
using 1 byte.
+
+* Extension name: ``arrow.bool8``.
+
+* The storage type of this extension is ``Int8`` where:
+
+  * **false** is denoted by the value ``0``.
+  * **true** can be specified using any non-zero value.

Review Comment:
   @felipecrv I think we're talking about mostly the same semantics but with 
slightly different phrasing.
   
   The distinction between
   
   **A**: "producers SHOULD produce 0 or 1 values"
   
   and
   
   **B**: "producers MUST produce 0 or 1 values" + affordances for "less 
strictly-conformant producers"
   
   is very subtle. IMO the first statement is a simpler and clearer description 
of the specification.
   
   I'll add 2 data points to the discussion to make things more concrete:
   1. I did some 
[investigation](https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b)
 into how numpy handles this in the context of zero-copy sharing with pyarrow. 
It appears numpy does in fact canonicalize boolean values to 0 and 1, but 
understands any nonzero value to be true without forcing a copy. This aligns 
well with the behavior we're discussing.
   2. libcudf defines its [BOOL8 
type](https://docs.rapids.ai/api/libcudf/stable/group__utility__types#ggadf077607da617d1dadcc5417e2783539a05afd9eb8887a406d47474cd3809a5dd)
 as "Boolean using one byte per value, 0 == false, else true". It may be true 
that CUDF will often or even alway use the values 0 or 1 (I don't actually 
know), but it's not consistent with the documented behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to