>> Before the vote, I would like to see verification that this truly enables >> zero-copy to/from NumPy bool arrays in Python.
> I think this is an implementation issue more than a specification issue...I am not personally worried about any provisions on the specification that might make this impossible. To clarify, what I am looking for here is definite confirmation that the proposed representation (in which a signed int8 zero value indicates False and any non-zero signed int8 value indicates True) corresponds to the representation used by NumPy such that bidirectional zero-copy is made possible. This seems to me like a specification issue. Ian On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > Thank you for this! I have definitely run across the one-byte-per-item > bool in numpy, DuckDB, and cudf. I haven't heard any discussion about > DuckDB here but I am fairly sure that they represent their boolean > type as an int8 as well [1]. > > > Before the vote, I would like to see verification that this truly enables > > zero-copy to/from NumPy bool arrays in Python. > > I think this is an implementation issue more than a specification > issue...I am not personally worried about any provisions on the > specification that might make this impossible. > > -dewey > > [1] > https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 > > On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > Hi Joel, > > > > This looks good to me on the principle. Can you split the spec and the > > implementation(s) into separate PRs? > > > > Regards > > > > Antoine. > > > > > > Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : > > > Hi Arrow devs, > > > > > > I'm working on adding an extension type for 8-bit booleans, and wanted > to > > > start a discussion about it here because it could be valuable to > others if > > > adopted as a canonical extension type. > > > > > > The native implementation of the Boolean type uses 1 bit to encode each > > > value, enabling a very compact representation. This is favorable for > many > > > workloads, but lots of systems that want to produce/consume Boolean > arrays > > > use an 8-bit representation internally and are forced to copy/convert > at > > > their periphery. For these scenarios where zero-copy compatibility is > > > important, the 8-bit representation of boolean values may be preferred. > > > This can benefit interactions with existing libraries that avoid > packing > > > column data like 1-bit booleans for parallelization purposes, > including GPU > > > libraries such as libcudf. The original issue [1] identifies numpy > > > conversion as a specific use-case as well. > > > > > > The details of the extension type can be found in the draft PR [2] > which > > > contains a Go implementation (WIP) and an update to the documentation > for > > > canonical extension types. I plan to add a C++ implementation as well > but > > > wanted to open this discussion first. > > > > > > A quick overview of the layout / semantics proposed in the PR: > > > Storage Type: Int8 > > > Value Semantics: 0 == false, any non-zero value is true > > > > > > I'd appreciate any feedback here or on the PR. If this all seems > reasonable > > > then I'll move forward with the next implementation and open up another > > > proposal for a formal vote. Thanks! > > > > > > [1]: https://github.com/apache/arrow/issues/17682 > > > [2]: https://github.com/apache/arrow/pull/43234 > > > >