Thanks for taking the initiative on this! As demonstrated by [1], the wish for an 8-bit Boolean extension type is long-standing. I think this is a worthwhile addition to Arrow's canonical extension types.
Before the vote, I would like to see verification that this truly enables zero-copy to/from NumPy bool arrays in Python. Ian On Tue, Jul 16, 2024 at 7:29 AM Joel Lubinitsky <joell...@gmail.com> wrote: > Hi Arrow devs, > > I'm working on adding an extension type for 8-bit booleans, and wanted to > start a discussion about it here because it could be valuable to others if > adopted as a canonical extension type. > > The native implementation of the Boolean type uses 1 bit to encode each > value, enabling a very compact representation. This is favorable for many > workloads, but lots of systems that want to produce/consume Boolean arrays > use an 8-bit representation internally and are forced to copy/convert at > their periphery. For these scenarios where zero-copy compatibility is > important, the 8-bit representation of boolean values may be preferred. > This can benefit interactions with existing libraries that avoid packing > column data like 1-bit booleans for parallelization purposes, including GPU > libraries such as libcudf. The original issue [1] identifies numpy > conversion as a specific use-case as well. > > The details of the extension type can be found in the draft PR [2] which > contains a Go implementation (WIP) and an update to the documentation for > canonical extension types. I plan to add a C++ implementation as well but > wanted to open this discussion first. > > A quick overview of the layout / semantics proposed in the PR: > Storage Type: Int8 > Value Semantics: 0 == false, any non-zero value is true > > I'd appreciate any feedback here or on the PR. If this all seems reasonable > then I'll move forward with the next implementation and open up another > proposal for a formal vote. Thanks! > > [1]: https://github.com/apache/arrow/issues/17682 > [2]: https://github.com/apache/arrow/pull/43234 >