Hi Arrow devs, I'm working on adding an extension type for 8-bit booleans, and wanted to start a discussion about it here because it could be valuable to others if adopted as a canonical extension type.
The native implementation of the Boolean type uses 1 bit to encode each value, enabling a very compact representation. This is favorable for many workloads, but lots of systems that want to produce/consume Boolean arrays use an 8-bit representation internally and are forced to copy/convert at their periphery. For these scenarios where zero-copy compatibility is important, the 8-bit representation of boolean values may be preferred. This can benefit interactions with existing libraries that avoid packing column data like 1-bit booleans for parallelization purposes, including GPU libraries such as libcudf. The original issue [1] identifies numpy conversion as a specific use-case as well. The details of the extension type can be found in the draft PR [2] which contains a Go implementation (WIP) and an update to the documentation for canonical extension types. I plan to add a C++ implementation as well but wanted to open this discussion first. A quick overview of the layout / semantics proposed in the PR: Storage Type: Int8 Value Semantics: 0 == false, any non-zero value is true I'd appreciate any feedback here or on the PR. If this all seems reasonable then I'll move forward with the next implementation and open up another proposal for a formal vote. Thanks! [1]: https://github.com/apache/arrow/issues/17682 [2]: https://github.com/apache/arrow/pull/43234