I think it would confuse implementors of the spec and people implementing
kernels way too much. “the bool Arrow type” should probably not start
meaning two different things.

—
Felipe

On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> wrote:

> As Boolean is already in the arrow type system I think it might be worth
> asking the question as to whether this should be an extension type or a
> first class type.
>
> Given what I think of the  last discussion on the trade-offs [1], I think
> there is room for debate here, since Boolean is not currently
> parameterized, adding it as an existing type would require a new top level
> type.
>
> Thanks,
> Micah
>
> [1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
>
> On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> wrote:
>
> > Thank you Joel for working on this! I have also came across
> > the need for a byte packed boolean support when implementing the
> > Python dataframe interchange protocol and also DPack which
> > is implemented in Arrow C++. The extension type is a great solution.
> >
> > I will comment on the PR if I have any questions.
> >
> > Alenka
> >
> > V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org>
> > napisala:
> >
> > > Thanks Joel and Matt. This looks good to me.
> > >
> > > I think it's worth saying here that Arrow-producing components should
> > still
> > > by default emit Booleans in the standard bit-packed Arrow layout. This
> > > proposed bool8 canonical extension type is intended to be used in
> > > applications where the producer knows that the consumer can correctly
> > > interpret the bool8 extension type and where using it is more efficient
> > > than converting the data to the standard bit-packed layout.
> > >
> > > Ian
> > >
> > > On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com>
> > wrote:
> > >
> > > > Just chiming in that the libcudf documentation[1] states that this
> > > proposal
> > > > should work just fine. Bool8 type is described as "0 == false, else
> > > true".
> > > >
> > > > --Matt
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539
> > > >
> > > > On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com>
> > > wrote:
> > > >
> > > > > Thank you for your comments.
> > > > >
> > > > > I spent some time trying to confirm definitively that this proposal
> > > would
> > > > > enable zero copy sharing both ways between pyarrow and numpy. I put
> > > > > together the following gist [1] with my experiment.
> > > > >
> > > > > To summarize the results:
> > > > > - I was able to share the underlying value buffer both ways and
> have
> > it
> > > > be
> > > > > interpreted correctly in each case.
> > > > > - Numpy will write 0 or 1 to the value buffer to indicate False or
> > > True.
> > > > > Importantly, numpy will also understand values outside this range
> to
> > > mean
> > > > > True without requiring a copy. This tracks closely with the
> proposed
> > > > > semantics.
> > > > >
> > > > > [1]:
> > https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b
> > > > >
> > > > > On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org>
> > wrote:
> > > > >
> > > > > > >> Before the vote, I would like to see verification that this
> > truly
> > > > > > enables
> > > > > > >> zero-copy to/from NumPy bool arrays in Python.
> > > > > >
> > > > > > > I think this is an implementation issue more than a
> specification
> > > > > > issue...I am not personally worried about any provisions on the
> > > > > > specification that might make this impossible.
> > > > > >
> > > > > > To clarify, what I am looking for here is definite confirmation
> > that
> > > > > > the proposed representation (in which a signed int8 zero value
> > > > indicates
> > > > > > False and any non-zero signed int8 value indicates True)
> > corresponds
> > > to
> > > > > the
> > > > > > representation used by NumPy such that bidirectional zero-copy is
> > > made
> > > > > > possible. This seems to me like a specification issue.
> > > > > >
> > > > > > Ian
> > > > > >
> > > > > > On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
> > > > > > <de...@voltrondata.com.invalid> wrote:
> > > > > >
> > > > > > > Thank you for this! I have definitely run across the
> > > > one-byte-per-item
> > > > > > > bool in numpy, DuckDB, and cudf. I haven't heard any discussion
> > > about
> > > > > > > DuckDB here but I am fairly sure that they represent their
> > boolean
> > > > > > > type as an int8 as well [1].
> > > > > > >
> > > > > > > > Before the vote, I would like to see verification that this
> > truly
> > > > > > enables
> > > > > > > > zero-copy to/from NumPy bool arrays in Python.
> > > > > > >
> > > > > > > I think this is an implementation issue more than a
> specification
> > > > > > > issue...I am not personally worried about any provisions on the
> > > > > > > specification that might make this impossible.
> > > > > > >
> > > > > > > -dewey
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
> > > > > > >
> > > > > > > On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <
> > > anto...@python.org>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Hi Joel,
> > > > > > > >
> > > > > > > > This looks good to me on the principle. Can you split the
> spec
> > > and
> > > > > the
> > > > > > > > implementation(s) into separate PRs?
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > > > > > > > Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
> > > > > > > > > Hi Arrow devs,
> > > > > > > > >
> > > > > > > > > I'm working on adding an extension type for 8-bit booleans,
> > and
> > > > > > wanted
> > > > > > > to
> > > > > > > > > start a discussion about it here because it could be
> valuable
> > > to
> > > > > > > others if
> > > > > > > > > adopted as a canonical extension type.
> > > > > > > > >
> > > > > > > > > The native implementation of the Boolean type uses 1 bit to
> > > > encode
> > > > > > each
> > > > > > > > > value, enabling a very compact representation. This is
> > > favorable
> > > > > for
> > > > > > > many
> > > > > > > > > workloads, but lots of systems that want to produce/consume
> > > > Boolean
> > > > > > > arrays
> > > > > > > > > use an 8-bit representation internally and are forced to
> > > > > copy/convert
> > > > > > > at
> > > > > > > > > their periphery. For these scenarios where zero-copy
> > > > compatibility
> > > > > is
> > > > > > > > > important, the 8-bit representation of boolean values may
> be
> > > > > > preferred.
> > > > > > > > > This can benefit interactions with existing libraries that
> > > avoid
> > > > > > > packing
> > > > > > > > > column data like 1-bit booleans for parallelization
> purposes,
> > > > > > > including GPU
> > > > > > > > > libraries such as libcudf. The original issue [1]
> identifies
> > > > numpy
> > > > > > > > > conversion as a specific use-case as well.
> > > > > > > > >
> > > > > > > > > The details of the extension type can be found in the draft
> > PR
> > > > [2]
> > > > > > > which
> > > > > > > > > contains a Go implementation (WIP) and an update to the
> > > > > documentation
> > > > > > > for
> > > > > > > > > canonical extension types. I plan to add a C++
> implementation
> > > as
> > > > > well
> > > > > > > but
> > > > > > > > > wanted to open this discussion first.
> > > > > > > > >
> > > > > > > > > A quick overview of the layout / semantics proposed in the
> > PR:
> > > > > > > > > Storage Type: Int8
> > > > > > > > > Value Semantics: 0 == false, any non-zero value is true
> > > > > > > > >
> > > > > > > > > I'd appreciate any feedback here or on the PR. If this all
> > > seems
> > > > > > > reasonable
> > > > > > > > > then I'll move forward with the next implementation and
> open
> > up
> > > > > > another
> > > > > > > > > proposal for a formal vote. Thanks!
> > > > > > > > >
> > > > > > > > > [1]: https://github.com/apache/arrow/issues/17682
> > > > > > > > > [2]: https://github.com/apache/arrow/pull/43234
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to