Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Dewey Dunnington Fri, 19 Jul 2024 06:04:26 -0700

The extension-ness of it is a valid point...all the other cases where
we have multiple Arrow types for the same element type (e.g., String,
LargeString, StringView) are first-class types. For a Bool8, the
tradeoffs are roughly the same (less support for StringView and
LargeString, more space required for LargeString, etc.).


For me the choice of whether or not to have this be a first-class type
or an extension type is just because there is no change required in
Schema.fbs/existing implementations can pass through instances of the
type without modification (as long as they support extension types). I
believe there was some consensus on a previous thread that I can't
find now that new types should be implemented as extension types if
possible for these (and perhaps other) reasons.


On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Agreed with Felipe. This is meant for communicating with non-Arrow type
> systems, but shouldn't be regarded as an alternative first-class boolean
> type.
>
> Regards
>
> Antoine.
>
>
> Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit :
> > I think it would confuse implementors of the spec and people implementing
> > kernels way too much. “the bool Arrow type” should probably not start
> > meaning two different things.
> >
> > —
> > Felipe
> >
> > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> wrote:
> >
> >> As Boolean is already in the arrow type system I think it might be worth
> >> asking the question as to whether this should be an extension type or a
> >> first class type.
> >>
> >> Given what I think of the  last discussion on the trade-offs [1], I think
> >> there is room for debate here, since Boolean is not currently
> >> parameterized, adding it as an existing type would require a new top level
> >> type.
> >>
> >> Thanks,
> >> Micah
> >>
> >> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> >>
> >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> wrote:
> >>
> >>> Thank you Joel for working on this! I have also came across
> >>> the need for a byte packed boolean support when implementing the
> >>> Python dataframe interchange protocol and also DPack which
> >>> is implemented in Arrow C++. The extension type is a great solution.
> >>>
> >>> I will comment on the PR if I have any questions.
> >>>
> >>> Alenka
> >>>
> >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org>
> >>> napisala:
> >>>
> >>>> Thanks Joel and Matt. This looks good to me.
> >>>>
> >>>> I think it's worth saying here that Arrow-producing components should
> >>> still
> >>>> by default emit Booleans in the standard bit-packed Arrow layout. This
> >>>> proposed bool8 canonical extension type is intended to be used in
> >>>> applications where the producer knows that the consumer can correctly
> >>>> interpret the bool8 extension type and where using it is more efficient
> >>>> than converting the data to the standard bit-packed layout.
> >>>>
> >>>> Ian
> >>>>
> >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com>
> >>> wrote:
> >>>>
> >>>>> Just chiming in that the libcudf documentation[1] states that this
> >>>> proposal
> >>>>> should work just fine. Bool8 type is described as "0 == false, else
> >>>> true".
> >>>>>
> >>>>> --Matt
> >>>>>
> >>>>> [1]:
> >>>>>
> >>>>>
> >>>>
> >>>
> >> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539
> >>>>>
> >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Thank you for your comments.
> >>>>>>
> >>>>>> I spent some time trying to confirm definitively that this proposal
> >>>> would
> >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I put
> >>>>>> together the following gist [1] with my experiment.
> >>>>>>
> >>>>>> To summarize the results:
> >>>>>> - I was able to share the underlying value buffer both ways and
> >> have
> >>> it
> >>>>> be
> >>>>>> interpreted correctly in each case.
> >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False or
> >>>> True.
> >>>>>> Importantly, numpy will also understand values outside this range
> >> to
> >>>> mean
> >>>>>> True without requiring a copy. This tracks closely with the
> >> proposed
> >>>>>> semantics.
> >>>>>>
> >>>>>> [1]:
> >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b
> >>>>>>
> >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>>>> Before the vote, I would like to see verification that this
> >>> truly
> >>>>>>> enables
> >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> >>>>>>>
> >>>>>>>> I think this is an implementation issue more than a
> >> specification
> >>>>>>> issue...I am not personally worried about any provisions on the
> >>>>>>> specification that might make this impossible.
> >>>>>>>
> >>>>>>> To clarify, what I am looking for here is definite confirmation
> >>> that
> >>>>>>> the proposed representation (in which a signed int8 zero value
> >>>>> indicates
> >>>>>>> False and any non-zero signed int8 value indicates True)
> >>> corresponds
> >>>> to
> >>>>>> the
> >>>>>>> representation used by NumPy such that bidirectional zero-copy is
> >>>> made
> >>>>>>> possible. This seems to me like a specification issue.
> >>>>>>>
> >>>>>>> Ian
> >>>>>>>
> >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
> >>>>>>> <de...@voltrondata.com.invalid> wrote:
> >>>>>>>
> >>>>>>>> Thank you for this! I have definitely run across the
> >>>>> one-byte-per-item
> >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any discussion
> >>>> about
> >>>>>>>> DuckDB here but I am fairly sure that they represent their
> >>> boolean
> >>>>>>>> type as an int8 as well [1].
> >>>>>>>>
> >>>>>>>>> Before the vote, I would like to see verification that this
> >>> truly
> >>>>>>> enables
> >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> >>>>>>>>
> >>>>>>>> I think this is an implementation issue more than a
> >> specification
> >>>>>>>> issue...I am not personally worried about any provisions on the
> >>>>>>>> specification that might make this impossible.
> >>>>>>>>
> >>>>>>>> -dewey
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
> >>>>>>>>
> >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <
> >>>> anto...@python.org>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hi Joel,
> >>>>>>>>>
> >>>>>>>>> This looks good to me on the principle. Can you split the
> >> spec
> >>>> and
> >>>>>> the
> >>>>>>>>> implementation(s) into separate PRs?
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> Antoine.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
> >>>>>>>>>> Hi Arrow devs,
> >>>>>>>>>>
> >>>>>>>>>> I'm working on adding an extension type for 8-bit booleans,
> >>> and
> >>>>>>> wanted
> >>>>>>>> to
> >>>>>>>>>> start a discussion about it here because it could be
> >> valuable
> >>>> to
> >>>>>>>> others if
> >>>>>>>>>> adopted as a canonical extension type.
> >>>>>>>>>>
> >>>>>>>>>> The native implementation of the Boolean type uses 1 bit to
> >>>>> encode
> >>>>>>> each
> >>>>>>>>>> value, enabling a very compact representation. This is
> >>>> favorable
> >>>>>> for
> >>>>>>>> many
> >>>>>>>>>> workloads, but lots of systems that want to produce/consume
> >>>>> Boolean
> >>>>>>>> arrays
> >>>>>>>>>> use an 8-bit representation internally and are forced to
> >>>>>> copy/convert
> >>>>>>>> at
> >>>>>>>>>> their periphery. For these scenarios where zero-copy
> >>>>> compatibility
> >>>>>> is
> >>>>>>>>>> important, the 8-bit representation of boolean values may
> >> be
> >>>>>>> preferred.
> >>>>>>>>>> This can benefit interactions with existing libraries that
> >>>> avoid
> >>>>>>>> packing
> >>>>>>>>>> column data like 1-bit booleans for parallelization
> >> purposes,
> >>>>>>>> including GPU
> >>>>>>>>>> libraries such as libcudf. The original issue [1]
> >> identifies
> >>>>> numpy
> >>>>>>>>>> conversion as a specific use-case as well.
> >>>>>>>>>>
> >>>>>>>>>> The details of the extension type can be found in the draft
> >>> PR
> >>>>> [2]
> >>>>>>>> which
> >>>>>>>>>> contains a Go implementation (WIP) and an update to the
> >>>>>> documentation
> >>>>>>>> for
> >>>>>>>>>> canonical extension types. I plan to add a C++
> >> implementation
> >>>> as
> >>>>>> well
> >>>>>>>> but
> >>>>>>>>>> wanted to open this discussion first.
> >>>>>>>>>>
> >>>>>>>>>> A quick overview of the layout / semantics proposed in the
> >>> PR:
> >>>>>>>>>> Storage Type: Int8
> >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true
> >>>>>>>>>>
> >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all
> >>>> seems
> >>>>>>>> reasonable
> >>>>>>>>>> then I'll move forward with the next implementation and
> >> open
> >>> up
> >>>>>>> another
> >>>>>>>>>> proposal for a formal vote. Thanks!
> >>>>>>>>>>
> >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682
> >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Reply via email to