I'm fine in principle with this being an extension type I just want to make sure we had this conversation. Some replies inline.
I think it would confuse implementors of the spec and people implementing > kernels way too much. “the bool Arrow type” should probably not start > meaning two different things. This is meant for communicating with non-Arrow type > systems, but shouldn't be regarded as an alternative first-class boolean > type. As a counterpoint we have two different units for date [1] that don't really convey meaningful new information. Implementations have to deal with this somehow, and I think the only reason this exists is effectively to support different type systems. I > believe there was some consensus on a previous thread that I can't > find now that new types should be implemented as extension types if > possible for these (and perhaps other) reasons. I believe this was for net new types, for all the reasons mentioned canonical extension types are preferred. Again, the last thread I'm aware of is [2] which for all the reasons mentioned prefers canonical extension type. The only reason why I brought this up is because there is already a notion of Bool, and thus there is some consistency to having new bool representations directly in the specification. Thanks, Micah [1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L251 [2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t On Fri, Jul 19, 2024 at 6:04 AM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > The extension-ness of it is a valid point...all the other cases where > we have multiple Arrow types for the same element type (e.g., String, > LargeString, StringView) are first-class types. For a Bool8, the > tradeoffs are roughly the same (less support for StringView and > LargeString, more space required for LargeString, etc.). > > For me the choice of whether or not to have this be a first-class type > or an extension type is just because there is no change required in > Schema.fbs/existing implementations can pass through instances of the > type without modification (as long as they support extension types). I > believe there was some consensus on a previous thread that I can't > find now that new types should be implemented as extension types if > possible for these (and perhaps other) reasons. > > > On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Agreed with Felipe. This is meant for communicating with non-Arrow type > > systems, but shouldn't be regarded as an alternative first-class boolean > > type. > > > > Regards > > > > Antoine. > > > > > > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit : > > > I think it would confuse implementors of the spec and people > implementing > > > kernels way too much. “the bool Arrow type” should probably not start > > > meaning two different things. > > > > > > — > > > Felipe > > > > > > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > > >> As Boolean is already in the arrow type system I think it might be > worth > > >> asking the question as to whether this should be an extension type or > a > > >> first class type. > > >> > > >> Given what I think of the last discussion on the trade-offs [1], I > think > > >> there is room for debate here, since Boolean is not currently > > >> parameterized, adding it as an existing type would require a new top > level > > >> type. > > >> > > >> Thanks, > > >> Micah > > >> > > >> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t > > >> > > >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> > wrote: > > >> > > >>> Thank you Joel for working on this! I have also came across > > >>> the need for a byte packed boolean support when implementing the > > >>> Python dataframe interchange protocol and also DPack which > > >>> is implemented in Arrow C++. The extension type is a great solution. > > >>> > > >>> I will comment on the PR if I have any questions. > > >>> > > >>> Alenka > > >>> > > >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook < > ianmc...@apache.org> > > >>> napisala: > > >>> > > >>>> Thanks Joel and Matt. This looks good to me. > > >>>> > > >>>> I think it's worth saying here that Arrow-producing components > should > > >>> still > > >>>> by default emit Booleans in the standard bit-packed Arrow layout. > This > > >>>> proposed bool8 canonical extension type is intended to be used in > > >>>> applications where the producer knows that the consumer can > correctly > > >>>> interpret the bool8 extension type and where using it is more > efficient > > >>>> than converting the data to the standard bit-packed layout. > > >>>> > > >>>> Ian > > >>>> > > >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com> > > >>> wrote: > > >>>> > > >>>>> Just chiming in that the libcudf documentation[1] states that this > > >>>> proposal > > >>>>> should work just fine. Bool8 type is described as "0 == false, else > > >>>> true". > > >>>>> > > >>>>> --Matt > > >>>>> > > >>>>> [1]: > > >>>>> > > >>>>> > > >>>> > > >>> > > >> > https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539 > > >>>>> > > >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com> > > >>>> wrote: > > >>>>> > > >>>>>> Thank you for your comments. > > >>>>>> > > >>>>>> I spent some time trying to confirm definitively that this > proposal > > >>>> would > > >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I > put > > >>>>>> together the following gist [1] with my experiment. > > >>>>>> > > >>>>>> To summarize the results: > > >>>>>> - I was able to share the underlying value buffer both ways and > > >> have > > >>> it > > >>>>> be > > >>>>>> interpreted correctly in each case. > > >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False or > > >>>> True. > > >>>>>> Importantly, numpy will also understand values outside this range > > >> to > > >>>> mean > > >>>>>> True without requiring a copy. This tracks closely with the > > >> proposed > > >>>>>> semantics. > > >>>>>> > > >>>>>> [1]: > > >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b > > >>>>>> > > >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org> > > >>> wrote: > > >>>>>> > > >>>>>>>>> Before the vote, I would like to see verification that this > > >>> truly > > >>>>>>> enables > > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python. > > >>>>>>> > > >>>>>>>> I think this is an implementation issue more than a > > >> specification > > >>>>>>> issue...I am not personally worried about any provisions on the > > >>>>>>> specification that might make this impossible. > > >>>>>>> > > >>>>>>> To clarify, what I am looking for here is definite confirmation > > >>> that > > >>>>>>> the proposed representation (in which a signed int8 zero value > > >>>>> indicates > > >>>>>>> False and any non-zero signed int8 value indicates True) > > >>> corresponds > > >>>> to > > >>>>>> the > > >>>>>>> representation used by NumPy such that bidirectional zero-copy is > > >>>> made > > >>>>>>> possible. This seems to me like a specification issue. > > >>>>>>> > > >>>>>>> Ian > > >>>>>>> > > >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington > > >>>>>>> <de...@voltrondata.com.invalid> wrote: > > >>>>>>> > > >>>>>>>> Thank you for this! I have definitely run across the > > >>>>> one-byte-per-item > > >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any discussion > > >>>> about > > >>>>>>>> DuckDB here but I am fairly sure that they represent their > > >>> boolean > > >>>>>>>> type as an int8 as well [1]. > > >>>>>>>> > > >>>>>>>>> Before the vote, I would like to see verification that this > > >>> truly > > >>>>>>> enables > > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python. > > >>>>>>>> > > >>>>>>>> I think this is an implementation issue more than a > > >> specification > > >>>>>>>> issue...I am not personally worried about any provisions on the > > >>>>>>>> specification that might make this impossible. > > >>>>>>>> > > >>>>>>>> -dewey > > >>>>>>>> > > >>>>>>>> [1] > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37 > > >>>>>>>> > > >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou < > > >>>> anto...@python.org> > > >>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Hi Joel, > > >>>>>>>>> > > >>>>>>>>> This looks good to me on the principle. Can you split the > > >> spec > > >>>> and > > >>>>>> the > > >>>>>>>>> implementation(s) into separate PRs? > > >>>>>>>>> > > >>>>>>>>> Regards > > >>>>>>>>> > > >>>>>>>>> Antoine. > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit : > > >>>>>>>>>> Hi Arrow devs, > > >>>>>>>>>> > > >>>>>>>>>> I'm working on adding an extension type for 8-bit booleans, > > >>> and > > >>>>>>> wanted > > >>>>>>>> to > > >>>>>>>>>> start a discussion about it here because it could be > > >> valuable > > >>>> to > > >>>>>>>> others if > > >>>>>>>>>> adopted as a canonical extension type. > > >>>>>>>>>> > > >>>>>>>>>> The native implementation of the Boolean type uses 1 bit to > > >>>>> encode > > >>>>>>> each > > >>>>>>>>>> value, enabling a very compact representation. This is > > >>>> favorable > > >>>>>> for > > >>>>>>>> many > > >>>>>>>>>> workloads, but lots of systems that want to produce/consume > > >>>>> Boolean > > >>>>>>>> arrays > > >>>>>>>>>> use an 8-bit representation internally and are forced to > > >>>>>> copy/convert > > >>>>>>>> at > > >>>>>>>>>> their periphery. For these scenarios where zero-copy > > >>>>> compatibility > > >>>>>> is > > >>>>>>>>>> important, the 8-bit representation of boolean values may > > >> be > > >>>>>>> preferred. > > >>>>>>>>>> This can benefit interactions with existing libraries that > > >>>> avoid > > >>>>>>>> packing > > >>>>>>>>>> column data like 1-bit booleans for parallelization > > >> purposes, > > >>>>>>>> including GPU > > >>>>>>>>>> libraries such as libcudf. The original issue [1] > > >> identifies > > >>>>> numpy > > >>>>>>>>>> conversion as a specific use-case as well. > > >>>>>>>>>> > > >>>>>>>>>> The details of the extension type can be found in the draft > > >>> PR > > >>>>> [2] > > >>>>>>>> which > > >>>>>>>>>> contains a Go implementation (WIP) and an update to the > > >>>>>> documentation > > >>>>>>>> for > > >>>>>>>>>> canonical extension types. I plan to add a C++ > > >> implementation > > >>>> as > > >>>>>> well > > >>>>>>>> but > > >>>>>>>>>> wanted to open this discussion first. > > >>>>>>>>>> > > >>>>>>>>>> A quick overview of the layout / semantics proposed in the > > >>> PR: > > >>>>>>>>>> Storage Type: Int8 > > >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true > > >>>>>>>>>> > > >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all > > >>>> seems > > >>>>>>>> reasonable > > >>>>>>>>>> then I'll move forward with the next implementation and > > >> open > > >>> up > > >>>>>>> another > > >>>>>>>>>> proposal for a formal vote. Thanks! > > >>>>>>>>>> > > >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682 > > >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234 > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > >