Thank you Joel for working on this! I have also came across the need for a byte packed boolean support when implementing the Python dataframe interchange protocol and also DPack which is implemented in Arrow C++. The extension type is a great solution.
I will comment on the PR if I have any questions. Alenka V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org> napisala: > Thanks Joel and Matt. This looks good to me. > > I think it's worth saying here that Arrow-producing components should still > by default emit Booleans in the standard bit-packed Arrow layout. This > proposed bool8 canonical extension type is intended to be used in > applications where the producer knows that the consumer can correctly > interpret the bool8 extension type and where using it is more efficient > than converting the data to the standard bit-packed layout. > > Ian > > On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com> wrote: > > > Just chiming in that the libcudf documentation[1] states that this > proposal > > should work just fine. Bool8 type is described as "0 == false, else > true". > > > > --Matt > > > > [1]: > > > > > https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539 > > > > On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com> > wrote: > > > > > Thank you for your comments. > > > > > > I spent some time trying to confirm definitively that this proposal > would > > > enable zero copy sharing both ways between pyarrow and numpy. I put > > > together the following gist [1] with my experiment. > > > > > > To summarize the results: > > > - I was able to share the underlying value buffer both ways and have it > > be > > > interpreted correctly in each case. > > > - Numpy will write 0 or 1 to the value buffer to indicate False or > True. > > > Importantly, numpy will also understand values outside this range to > mean > > > True without requiring a copy. This tracks closely with the proposed > > > semantics. > > > > > > [1]: https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b > > > > > > On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org> wrote: > > > > > > > >> Before the vote, I would like to see verification that this truly > > > > enables > > > > >> zero-copy to/from NumPy bool arrays in Python. > > > > > > > > > I think this is an implementation issue more than a specification > > > > issue...I am not personally worried about any provisions on the > > > > specification that might make this impossible. > > > > > > > > To clarify, what I am looking for here is definite confirmation that > > > > the proposed representation (in which a signed int8 zero value > > indicates > > > > False and any non-zero signed int8 value indicates True) corresponds > to > > > the > > > > representation used by NumPy such that bidirectional zero-copy is > made > > > > possible. This seems to me like a specification issue. > > > > > > > > Ian > > > > > > > > On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington > > > > <de...@voltrondata.com.invalid> wrote: > > > > > > > > > Thank you for this! I have definitely run across the > > one-byte-per-item > > > > > bool in numpy, DuckDB, and cudf. I haven't heard any discussion > about > > > > > DuckDB here but I am fairly sure that they represent their boolean > > > > > type as an int8 as well [1]. > > > > > > > > > > > Before the vote, I would like to see verification that this truly > > > > enables > > > > > > zero-copy to/from NumPy bool arrays in Python. > > > > > > > > > > I think this is an implementation issue more than a specification > > > > > issue...I am not personally worried about any provisions on the > > > > > specification that might make this impossible. > > > > > > > > > > -dewey > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 > > > > > > > > > > On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < > anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > Hi Joel, > > > > > > > > > > > > This looks good to me on the principle. Can you split the spec > and > > > the > > > > > > implementation(s) into separate PRs? > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : > > > > > > > Hi Arrow devs, > > > > > > > > > > > > > > I'm working on adding an extension type for 8-bit booleans, and > > > > wanted > > > > > to > > > > > > > start a discussion about it here because it could be valuable > to > > > > > others if > > > > > > > adopted as a canonical extension type. > > > > > > > > > > > > > > The native implementation of the Boolean type uses 1 bit to > > encode > > > > each > > > > > > > value, enabling a very compact representation. This is > favorable > > > for > > > > > many > > > > > > > workloads, but lots of systems that want to produce/consume > > Boolean > > > > > arrays > > > > > > > use an 8-bit representation internally and are forced to > > > copy/convert > > > > > at > > > > > > > their periphery. For these scenarios where zero-copy > > compatibility > > > is > > > > > > > important, the 8-bit representation of boolean values may be > > > > preferred. > > > > > > > This can benefit interactions with existing libraries that > avoid > > > > > packing > > > > > > > column data like 1-bit booleans for parallelization purposes, > > > > > including GPU > > > > > > > libraries such as libcudf. The original issue [1] identifies > > numpy > > > > > > > conversion as a specific use-case as well. > > > > > > > > > > > > > > The details of the extension type can be found in the draft PR > > [2] > > > > > which > > > > > > > contains a Go implementation (WIP) and an update to the > > > documentation > > > > > for > > > > > > > canonical extension types. I plan to add a C++ implementation > as > > > well > > > > > but > > > > > > > wanted to open this discussion first. > > > > > > > > > > > > > > A quick overview of the layout / semantics proposed in the PR: > > > > > > > Storage Type: Int8 > > > > > > > Value Semantics: 0 == false, any non-zero value is true > > > > > > > > > > > > > > I'd appreciate any feedback here or on the PR. If this all > seems > > > > > reasonable > > > > > > > then I'll move forward with the next implementation and open up > > > > another > > > > > > > proposal for a formal vote. Thanks! > > > > > > > > > > > > > > [1]: https://github.com/apache/arrow/issues/17682 > > > > > > > [2]: https://github.com/apache/arrow/pull/43234 > > > > > > > > > > > > > > > > > > > > > >