The extension-ness of it is a valid point...all the other cases where we have multiple Arrow types for the same element type (e.g., String, LargeString, StringView) are first-class types. For a Bool8, the tradeoffs are roughly the same (less support for StringView and LargeString, more space required for LargeString, etc.).
For me the choice of whether or not to have this be a first-class type or an extension type is just because there is no change required in Schema.fbs/existing implementations can pass through instances of the type without modification (as long as they support extension types). I believe there was some consensus on a previous thread that I can't find now that new types should be implemented as extension types if possible for these (and perhaps other) reasons. On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote: > > > Agreed with Felipe. This is meant for communicating with non-Arrow type > systems, but shouldn't be regarded as an alternative first-class boolean > type. > > Regards > > Antoine. > > > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit : > > I think it would confuse implementors of the spec and people implementing > > kernels way too much. “the bool Arrow type” should probably not start > > meaning two different things. > > > > — > > Felipe > > > > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> wrote: > > > >> As Boolean is already in the arrow type system I think it might be worth > >> asking the question as to whether this should be an extension type or a > >> first class type. > >> > >> Given what I think of the last discussion on the trade-offs [1], I think > >> there is room for debate here, since Boolean is not currently > >> parameterized, adding it as an existing type would require a new top level > >> type. > >> > >> Thanks, > >> Micah > >> > >> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t > >> > >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> wrote: > >> > >>> Thank you Joel for working on this! I have also came across > >>> the need for a byte packed boolean support when implementing the > >>> Python dataframe interchange protocol and also DPack which > >>> is implemented in Arrow C++. The extension type is a great solution. > >>> > >>> I will comment on the PR if I have any questions. > >>> > >>> Alenka > >>> > >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org> > >>> napisala: > >>> > >>>> Thanks Joel and Matt. This looks good to me. > >>>> > >>>> I think it's worth saying here that Arrow-producing components should > >>> still > >>>> by default emit Booleans in the standard bit-packed Arrow layout. This > >>>> proposed bool8 canonical extension type is intended to be used in > >>>> applications where the producer knows that the consumer can correctly > >>>> interpret the bool8 extension type and where using it is more efficient > >>>> than converting the data to the standard bit-packed layout. > >>>> > >>>> Ian > >>>> > >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com> > >>> wrote: > >>>> > >>>>> Just chiming in that the libcudf documentation[1] states that this > >>>> proposal > >>>>> should work just fine. Bool8 type is described as "0 == false, else > >>>> true". > >>>>> > >>>>> --Matt > >>>>> > >>>>> [1]: > >>>>> > >>>>> > >>>> > >>> > >> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539 > >>>>> > >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com> > >>>> wrote: > >>>>> > >>>>>> Thank you for your comments. > >>>>>> > >>>>>> I spent some time trying to confirm definitively that this proposal > >>>> would > >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I put > >>>>>> together the following gist [1] with my experiment. > >>>>>> > >>>>>> To summarize the results: > >>>>>> - I was able to share the underlying value buffer both ways and > >> have > >>> it > >>>>> be > >>>>>> interpreted correctly in each case. > >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False or > >>>> True. > >>>>>> Importantly, numpy will also understand values outside this range > >> to > >>>> mean > >>>>>> True without requiring a copy. This tracks closely with the > >> proposed > >>>>>> semantics. > >>>>>> > >>>>>> [1]: > >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b > >>>>>> > >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org> > >>> wrote: > >>>>>> > >>>>>>>>> Before the vote, I would like to see verification that this > >>> truly > >>>>>>> enables > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python. > >>>>>>> > >>>>>>>> I think this is an implementation issue more than a > >> specification > >>>>>>> issue...I am not personally worried about any provisions on the > >>>>>>> specification that might make this impossible. > >>>>>>> > >>>>>>> To clarify, what I am looking for here is definite confirmation > >>> that > >>>>>>> the proposed representation (in which a signed int8 zero value > >>>>> indicates > >>>>>>> False and any non-zero signed int8 value indicates True) > >>> corresponds > >>>> to > >>>>>> the > >>>>>>> representation used by NumPy such that bidirectional zero-copy is > >>>> made > >>>>>>> possible. This seems to me like a specification issue. > >>>>>>> > >>>>>>> Ian > >>>>>>> > >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington > >>>>>>> <de...@voltrondata.com.invalid> wrote: > >>>>>>> > >>>>>>>> Thank you for this! I have definitely run across the > >>>>> one-byte-per-item > >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any discussion > >>>> about > >>>>>>>> DuckDB here but I am fairly sure that they represent their > >>> boolean > >>>>>>>> type as an int8 as well [1]. > >>>>>>>> > >>>>>>>>> Before the vote, I would like to see verification that this > >>> truly > >>>>>>> enables > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python. > >>>>>>>> > >>>>>>>> I think this is an implementation issue more than a > >> specification > >>>>>>>> issue...I am not personally worried about any provisions on the > >>>>>>>> specification that might make this impossible. > >>>>>>>> > >>>>>>>> -dewey > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 > >>>>>>>> > >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < > >>>> anto...@python.org> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Joel, > >>>>>>>>> > >>>>>>>>> This looks good to me on the principle. Can you split the > >> spec > >>>> and > >>>>>> the > >>>>>>>>> implementation(s) into separate PRs? > >>>>>>>>> > >>>>>>>>> Regards > >>>>>>>>> > >>>>>>>>> Antoine. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : > >>>>>>>>>> Hi Arrow devs, > >>>>>>>>>> > >>>>>>>>>> I'm working on adding an extension type for 8-bit booleans, > >>> and > >>>>>>> wanted > >>>>>>>> to > >>>>>>>>>> start a discussion about it here because it could be > >> valuable > >>>> to > >>>>>>>> others if > >>>>>>>>>> adopted as a canonical extension type. > >>>>>>>>>> > >>>>>>>>>> The native implementation of the Boolean type uses 1 bit to > >>>>> encode > >>>>>>> each > >>>>>>>>>> value, enabling a very compact representation. This is > >>>> favorable > >>>>>> for > >>>>>>>> many > >>>>>>>>>> workloads, but lots of systems that want to produce/consume > >>>>> Boolean > >>>>>>>> arrays > >>>>>>>>>> use an 8-bit representation internally and are forced to > >>>>>> copy/convert > >>>>>>>> at > >>>>>>>>>> their periphery. For these scenarios where zero-copy > >>>>> compatibility > >>>>>> is > >>>>>>>>>> important, the 8-bit representation of boolean values may > >> be > >>>>>>> preferred. > >>>>>>>>>> This can benefit interactions with existing libraries that > >>>> avoid > >>>>>>>> packing > >>>>>>>>>> column data like 1-bit booleans for parallelization > >> purposes, > >>>>>>>> including GPU > >>>>>>>>>> libraries such as libcudf. The original issue [1] > >> identifies > >>>>> numpy > >>>>>>>>>> conversion as a specific use-case as well. > >>>>>>>>>> > >>>>>>>>>> The details of the extension type can be found in the draft > >>> PR > >>>>> [2] > >>>>>>>> which > >>>>>>>>>> contains a Go implementation (WIP) and an update to the > >>>>>> documentation > >>>>>>>> for > >>>>>>>>>> canonical extension types. I plan to add a C++ > >> implementation > >>>> as > >>>>>> well > >>>>>>>> but > >>>>>>>>>> wanted to open this discussion first. > >>>>>>>>>> > >>>>>>>>>> A quick overview of the layout / semantics proposed in the > >>> PR: > >>>>>>>>>> Storage Type: Int8 > >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true > >>>>>>>>>> > >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all > >>>> seems > >>>>>>>> reasonable > >>>>>>>>>> then I'll move forward with the next implementation and > >> open > >>> up > >>>>>>> another > >>>>>>>>>> proposal for a formal vote. Thanks! > >>>>>>>>>> > >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682 > >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234 > >>>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >