Hi Arrow devs,

I'm working on adding an extension type for 8-bit booleans, and wanted to
start a discussion about it here because it could be valuable to others if
adopted as a canonical extension type.

The native implementation of the Boolean type uses 1 bit to encode each
value, enabling a very compact representation. This is favorable for many
workloads, but lots of systems that want to produce/consume Boolean arrays
use an 8-bit representation internally and are forced to copy/convert at
their periphery. For these scenarios where zero-copy compatibility is
important, the 8-bit representation of boolean values may be preferred.
This can benefit interactions with existing libraries that avoid packing
column data like 1-bit booleans for parallelization purposes, including GPU
libraries such as libcudf. The original issue [1] identifies numpy
conversion as a specific use-case as well.

The details of the extension type can be found in the draft PR [2] which
contains a Go implementation (WIP) and an update to the documentation for
canonical extension types. I plan to add a C++ implementation as well but
wanted to open this discussion first.

A quick overview of the layout / semantics proposed in the PR:
Storage Type: Int8
Value Semantics: 0 == false, any non-zero value is true

I'd appreciate any feedback here or on the PR. If this all seems reasonable
then I'll move forward with the next implementation and open up another
proposal for a formal vote. Thanks!

[1]: https://github.com/apache/arrow/issues/17682
[2]: https://github.com/apache/arrow/pull/43234

Reply via email to