Thanks Joel and Matt. This looks good to me. I think it's worth saying here that Arrow-producing components should still by default emit Booleans in the standard bit-packed Arrow layout. This proposed bool8 canonical extension type is intended to be used in applications where the producer knows that the consumer can correctly interpret the bool8 extension type and where using it is more efficient than converting the data to the standard bit-packed layout.
Ian On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com> wrote: > Just chiming in that the libcudf documentation[1] states that this proposal > should work just fine. Bool8 type is described as "0 == false, else true". > > --Matt > > [1]: > > https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539 > > On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com> wrote: > > > Thank you for your comments. > > > > I spent some time trying to confirm definitively that this proposal would > > enable zero copy sharing both ways between pyarrow and numpy. I put > > together the following gist [1] with my experiment. > > > > To summarize the results: > > - I was able to share the underlying value buffer both ways and have it > be > > interpreted correctly in each case. > > - Numpy will write 0 or 1 to the value buffer to indicate False or True. > > Importantly, numpy will also understand values outside this range to mean > > True without requiring a copy. This tracks closely with the proposed > > semantics. > > > > [1]: https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b > > > > On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org> wrote: > > > > > >> Before the vote, I would like to see verification that this truly > > > enables > > > >> zero-copy to/from NumPy bool arrays in Python. > > > > > > > I think this is an implementation issue more than a specification > > > issue...I am not personally worried about any provisions on the > > > specification that might make this impossible. > > > > > > To clarify, what I am looking for here is definite confirmation that > > > the proposed representation (in which a signed int8 zero value > indicates > > > False and any non-zero signed int8 value indicates True) corresponds to > > the > > > representation used by NumPy such that bidirectional zero-copy is made > > > possible. This seems to me like a specification issue. > > > > > > Ian > > > > > > On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington > > > <de...@voltrondata.com.invalid> wrote: > > > > > > > Thank you for this! I have definitely run across the > one-byte-per-item > > > > bool in numpy, DuckDB, and cudf. I haven't heard any discussion about > > > > DuckDB here but I am fairly sure that they represent their boolean > > > > type as an int8 as well [1]. > > > > > > > > > Before the vote, I would like to see verification that this truly > > > enables > > > > > zero-copy to/from NumPy bool arrays in Python. > > > > > > > > I think this is an implementation issue more than a specification > > > > issue...I am not personally worried about any provisions on the > > > > specification that might make this impossible. > > > > > > > > -dewey > > > > > > > > [1] > > > > > > > > > > https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 > > > > > > > > On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <anto...@python.org> > > > > wrote: > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > This looks good to me on the principle. Can you split the spec and > > the > > > > > implementation(s) into separate PRs? > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : > > > > > > Hi Arrow devs, > > > > > > > > > > > > I'm working on adding an extension type for 8-bit booleans, and > > > wanted > > > > to > > > > > > start a discussion about it here because it could be valuable to > > > > others if > > > > > > adopted as a canonical extension type. > > > > > > > > > > > > The native implementation of the Boolean type uses 1 bit to > encode > > > each > > > > > > value, enabling a very compact representation. This is favorable > > for > > > > many > > > > > > workloads, but lots of systems that want to produce/consume > Boolean > > > > arrays > > > > > > use an 8-bit representation internally and are forced to > > copy/convert > > > > at > > > > > > their periphery. For these scenarios where zero-copy > compatibility > > is > > > > > > important, the 8-bit representation of boolean values may be > > > preferred. > > > > > > This can benefit interactions with existing libraries that avoid > > > > packing > > > > > > column data like 1-bit booleans for parallelization purposes, > > > > including GPU > > > > > > libraries such as libcudf. The original issue [1] identifies > numpy > > > > > > conversion as a specific use-case as well. > > > > > > > > > > > > The details of the extension type can be found in the draft PR > [2] > > > > which > > > > > > contains a Go implementation (WIP) and an update to the > > documentation > > > > for > > > > > > canonical extension types. I plan to add a C++ implementation as > > well > > > > but > > > > > > wanted to open this discussion first. > > > > > > > > > > > > A quick overview of the layout / semantics proposed in the PR: > > > > > > Storage Type: Int8 > > > > > > Value Semantics: 0 == false, any non-zero value is true > > > > > > > > > > > > I'd appreciate any feedback here or on the PR. If this all seems > > > > reasonable > > > > > > then I'll move forward with the next implementation and open up > > > another > > > > > > proposal for a formal vote. Thanks! > > > > > > > > > > > > [1]: https://github.com/apache/arrow/issues/17682 > > > > > > [2]: https://github.com/apache/arrow/pull/43234 > > > > > > > > > > > > > > > >