Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Micah Kornfield Fri, 19 Jul 2024 09:25:34 -0700

I'm fine in principle with this being an extension type I just want to make
sure we had this conversation.  Some replies inline.


I think it would confuse implementors of the spec and people implementing
> kernels way too much. “the bool Arrow type” should probably not start
> meaning two different things.


This is meant for communicating with non-Arrow type
> systems, but shouldn't be regarded as an alternative first-class boolean
> type.


As a counterpoint we have two different units for date [1] that don't
really convey meaningful new information.  Implementations have to deal
with this somehow, and I think the only reason this exists is effectively
to support different type systems.

I
> believe there was some consensus on a previous thread that I can't
> find now that new types should be implemented as extension types if
> possible for these (and perhaps other) reasons.


I believe this was for net new types, for all the reasons mentioned
canonical extension types are preferred.  Again, the last thread I'm aware
of is [2] which for all the reasons mentioned prefers canonical extension
type.  The only reason why I brought this up is because there is already a
notion of Bool, and thus there is some consistency to having new bool
representations directly in the specification.

Thanks,
Micah




[1] https://github.com/apache/arrow/blob/main/format/Schema.fbs#L251
[2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t

On Fri, Jul 19, 2024 at 6:04 AM Dewey Dunnington
<[email protected]> wrote:

> The extension-ness of it is a valid point...all the other cases where
> we have multiple Arrow types for the same element type (e.g., String,
> LargeString, StringView) are first-class types. For a Bool8, the
> tradeoffs are roughly the same (less support for StringView and
> LargeString, more space required for LargeString, etc.).
>
> For me the choice of whether or not to have this be a first-class type
> or an extension type is just because there is no change required in
> Schema.fbs/existing implementations can pass through instances of the
> type without modification (as long as they support extension types). I
> believe there was some consensus on a previous thread that I can't
> find now that new types should be implemented as extension types if
> possible for these (and perhaps other) reasons.
>
>
> On Fri, Jul 19, 2024 at 5:39 AM Antoine Pitrou <[email protected]> wrote:
> >
> >
> > Agreed with Felipe. This is meant for communicating with non-Arrow type
> > systems, but shouldn't be regarded as an alternative first-class boolean
> > type.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit :
> > > I think it would confuse implementors of the spec and people
> implementing
> > > kernels way too much. “the bool Arrow type” should probably not start
> > > meaning two different things.
> > >
> > > —
> > > Felipe
> > >
> > > On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <[email protected]>
> wrote:
> > >
> > >> As Boolean is already in the arrow type system I think it might be
> worth
> > >> asking the question as to whether this should be an extension type or
> a
> > >> first class type.
> > >>
> > >> Given what I think of the  last discussion on the trade-offs [1], I
> think
> > >> there is room for debate here, since Boolean is not currently
> > >> parameterized, adding it as an existing type would require a new top
> level
> > >> type.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> > >>
> > >> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <[email protected]>
> wrote:
> > >>
> > >>> Thank you Joel for working on this! I have also came across
> > >>> the need for a byte packed boolean support when implementing the
> > >>> Python dataframe interchange protocol and also DPack which
> > >>> is implemented in Arrow C++. The extension type is a great solution.
> > >>>
> > >>> I will comment on the PR if I have any questions.
> > >>>
> > >>> Alenka
> > >>>
> > >>> V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <
> [email protected]>
> > >>> napisala:
> > >>>
> > >>>> Thanks Joel and Matt. This looks good to me.
> > >>>>
> > >>>> I think it's worth saying here that Arrow-producing components
> should
> > >>> still
> > >>>> by default emit Booleans in the standard bit-packed Arrow layout.
> This
> > >>>> proposed bool8 canonical extension type is intended to be used in
> > >>>> applications where the producer knows that the consumer can
> correctly
> > >>>> interpret the bool8 extension type and where using it is more
> efficient
> > >>>> than converting the data to the standard bit-packed layout.
> > >>>>
> > >>>> Ian
> > >>>>
> > >>>> On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Just chiming in that the libcudf documentation[1] states that this
> > >>>> proposal
> > >>>>> should work just fine. Bool8 type is described as "0 == false, else
> > >>>> true".
> > >>>>>
> > >>>>> --Matt
> > >>>>>
> > >>>>> [1]:
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539
> > >>>>>
> > >>>>> On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <[email protected]>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Thank you for your comments.
> > >>>>>>
> > >>>>>> I spent some time trying to confirm definitively that this
> proposal
> > >>>> would
> > >>>>>> enable zero copy sharing both ways between pyarrow and numpy. I
> put
> > >>>>>> together the following gist [1] with my experiment.
> > >>>>>>
> > >>>>>> To summarize the results:
> > >>>>>> - I was able to share the underlying value buffer both ways and
> > >> have
> > >>> it
> > >>>>> be
> > >>>>>> interpreted correctly in each case.
> > >>>>>> - Numpy will write 0 or 1 to the value buffer to indicate False or
> > >>>> True.
> > >>>>>> Importantly, numpy will also understand values outside this range
> > >> to
> > >>>> mean
> > >>>>>> True without requiring a copy. This tracks closely with the
> > >> proposed
> > >>>>>> semantics.
> > >>>>>>
> > >>>>>> [1]:
> > >>> https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b
> > >>>>>>
> > >>>>>> On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <[email protected]>
> > >>> wrote:
> > >>>>>>
> > >>>>>>>>> Before the vote, I would like to see verification that this
> > >>> truly
> > >>>>>>> enables
> > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> > >>>>>>>
> > >>>>>>>> I think this is an implementation issue more than a
> > >> specification
> > >>>>>>> issue...I am not personally worried about any provisions on the
> > >>>>>>> specification that might make this impossible.
> > >>>>>>>
> > >>>>>>> To clarify, what I am looking for here is definite confirmation
> > >>> that
> > >>>>>>> the proposed representation (in which a signed int8 zero value
> > >>>>> indicates
> > >>>>>>> False and any non-zero signed int8 value indicates True)
> > >>> corresponds
> > >>>> to
> > >>>>>> the
> > >>>>>>> representation used by NumPy such that bidirectional zero-copy is
> > >>>> made
> > >>>>>>> possible. This seems to me like a specification issue.
> > >>>>>>>
> > >>>>>>> Ian
> > >>>>>>>
> > >>>>>>> On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
> > >>>>>>> <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>>> Thank you for this! I have definitely run across the
> > >>>>> one-byte-per-item
> > >>>>>>>> bool in numpy, DuckDB, and cudf. I haven't heard any discussion
> > >>>> about
> > >>>>>>>> DuckDB here but I am fairly sure that they represent their
> > >>> boolean
> > >>>>>>>> type as an int8 as well [1].
> > >>>>>>>>
> > >>>>>>>>> Before the vote, I would like to see verification that this
> > >>> truly
> > >>>>>>> enables
> > >>>>>>>>> zero-copy to/from NumPy bool arrays in Python.
> > >>>>>>>>
> > >>>>>>>> I think this is an implementation issue more than a
> > >> specification
> > >>>>>>>> issue...I am not personally worried about any provisions on the
> > >>>>>>>> specification that might make this impossible.
> > >>>>>>>>
> > >>>>>>>> -dewey
> > >>>>>>>>
> > >>>>>>>> [1]
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
> > >>>>>>>>
> > >>>>>>>> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <
> > >>>> [email protected]>
> > >>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Hi Joel,
> > >>>>>>>>>
> > >>>>>>>>> This looks good to me on the principle. Can you split the
> > >> spec
> > >>>> and
> > >>>>>> the
> > >>>>>>>>> implementation(s) into separate PRs?
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>>
> > >>>>>>>>> Antoine.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
> > >>>>>>>>>> Hi Arrow devs,
> > >>>>>>>>>>
> > >>>>>>>>>> I'm working on adding an extension type for 8-bit booleans,
> > >>> and
> > >>>>>>> wanted
> > >>>>>>>> to
> > >>>>>>>>>> start a discussion about it here because it could be
> > >> valuable
> > >>>> to
> > >>>>>>>> others if
> > >>>>>>>>>> adopted as a canonical extension type.
> > >>>>>>>>>>
> > >>>>>>>>>> The native implementation of the Boolean type uses 1 bit to
> > >>>>> encode
> > >>>>>>> each
> > >>>>>>>>>> value, enabling a very compact representation. This is
> > >>>> favorable
> > >>>>>> for
> > >>>>>>>> many
> > >>>>>>>>>> workloads, but lots of systems that want to produce/consume
> > >>>>> Boolean
> > >>>>>>>> arrays
> > >>>>>>>>>> use an 8-bit representation internally and are forced to
> > >>>>>> copy/convert
> > >>>>>>>> at
> > >>>>>>>>>> their periphery. For these scenarios where zero-copy
> > >>>>> compatibility
> > >>>>>> is
> > >>>>>>>>>> important, the 8-bit representation of boolean values may
> > >> be
> > >>>>>>> preferred.
> > >>>>>>>>>> This can benefit interactions with existing libraries that
> > >>>> avoid
> > >>>>>>>> packing
> > >>>>>>>>>> column data like 1-bit booleans for parallelization
> > >> purposes,
> > >>>>>>>> including GPU
> > >>>>>>>>>> libraries such as libcudf. The original issue [1]
> > >> identifies
> > >>>>> numpy
> > >>>>>>>>>> conversion as a specific use-case as well.
> > >>>>>>>>>>
> > >>>>>>>>>> The details of the extension type can be found in the draft
> > >>> PR
> > >>>>> [2]
> > >>>>>>>> which
> > >>>>>>>>>> contains a Go implementation (WIP) and an update to the
> > >>>>>> documentation
> > >>>>>>>> for
> > >>>>>>>>>> canonical extension types. I plan to add a C++
> > >> implementation
> > >>>> as
> > >>>>>> well
> > >>>>>>>> but
> > >>>>>>>>>> wanted to open this discussion first.
> > >>>>>>>>>>
> > >>>>>>>>>> A quick overview of the layout / semantics proposed in the
> > >>> PR:
> > >>>>>>>>>> Storage Type: Int8
> > >>>>>>>>>> Value Semantics: 0 == false, any non-zero value is true
> > >>>>>>>>>>
> > >>>>>>>>>> I'd appreciate any feedback here or on the PR. If this all
> > >>>> seems
> > >>>>>>>> reasonable
> > >>>>>>>>>> then I'll move forward with the next implementation and
> > >> open
> > >>> up
> > >>>>>>> another
> > >>>>>>>>>> proposal for a formal vote. Thanks!
> > >>>>>>>>>>
> > >>>>>>>>>> [1]: https://github.com/apache/arrow/issues/17682
> > >>>>>>>>>> [2]: https://github.com/apache/arrow/pull/43234
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >
>

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Reply via email to