zeroshade commented on code in PR #43234:
URL: https://github.com/apache/arrow/pull/43234#discussion_r1683049633


##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -283,6 +283,28 @@ UUID
    A specific UUID version is not required or guaranteed. This extension 
represents
    UUIDs as FixedSizeBinary(16) with big-endian notation and does not 
interpret the bytes in any way.
 
+8-bit Boolean
+====
+
+Bool8 represents a boolean value using 1 byte (8 bits) to store each value 
instead of only 1 bit as in
+the native Arrow Boolean type. Although less compact that the native 
representation, Bool8 may have
+better zero-copy compatibility with various systems that also store booleans 
using 1 byte.
+
+* Extension name: ``arrow.bool8``.
+
+* The storage type of this extension is ``Int8`` where:
+
+  * **false** is denoted by the value ``0``.
+  * **true** can be specified using any non-zero value.

Review Comment:
   I agree that it's a very nuanced topic. 
   
   > I believe this is how libcudf implements bool array comparison. This only 
works if producers of bool arrays stick to producing values in the [0, 1] range.
   
   > 
https://github.com/rapidsai/cudf/blob/34dea6fe40fc20966b48257853865111df4a687f/cpp/include/cudf/table/experimental/row_operators.cuh#L1293-L1297
   
   Doesn't this fall into the C++ semantics we described earlier? For a bool 
array comparison, this will cast each value to bool to perform the comparisons 
which will, as per C++ semantics, do precisely what you mentioned earlier to 
perform the normalization as needed. Since C++ will perform the casting in this 
case, the only reason it would fail is if the compute did a memcmp or something 
rather than performing a comparison on each element.
   
   > All we need is the producer to have the equivalent of 
reinterpret_cast<const bool*>(buffer) to convert an int8 array into a Bool8 
array without a pass that checks the values. This reinterpret_cast doesn't 
check byte by byte because the user is supposed to guarantee by construction of 
the buffer that the values on the buffer are in [0,1].
   
   I wasn't referring to the `reinterpret_cast` case where we know that the 
user has guaranteed it that the values are in [0, 1], I was referring to the 
row-based casting where each value is being cast to bool. If we state that 
producers of a `Bool8` array need to ensure that the values are in `[0, 1]`, 
then any cast implementation for int8 -> bool8 *MUST* at least perform a check 
to ensure the values are in [0, 1] if they intend on doing the equivalent of 
the `reinterpret_cast`. This is because the cast kernel itself didn't produce 
the int8 array, and thus can't know whether or not it conforms. Like you said, 
if we explicitly state that it *must* follow this, then one of the directions 
has to normalize. If we want consumers to be able to assume [0, 1] for Bool8, 
then any int8->bool8 cast will have to normalize instead of being able to do 
zero-copy (unless it does a pass to check first).
   
   > You don't have to say MUST in the spec. Just say that 1 is the value of 
choice when you're forced to pick a non-zero value to represent true.
   
   Isn't this equivalent to just saying *SHOULD*? If you agree we don't have to 
use `MUST`, then I think we're pretty much saying the same thing. The important 
distinction though, is that if we don't say that producers *MUST* use [0, 1], 
then consumers cannot and should not assume [0, 1]. Make sense?
   
   I think at this point it makes more sense for us to summarize this and put 
it onto the mailing list and see if we can get more input from other people. I 
agree that this shouldn't hold up the spec though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to