Agreed with Felipe. This is meant for communicating with non-Arrow type systems, but shouldn't be regarded as an alternative first-class boolean type.

Regards

Antoine.


Le 19/07/2024 à 06:30, Felipe Oliveira Carvalho a écrit :
I think it would confuse implementors of the spec and people implementing
kernels way too much. “the bool Arrow type” should probably not start
meaning two different things.

—
Felipe

On Fri, 19 Jul 2024 at 01:26 Micah Kornfield <emkornfi...@gmail.com> wrote:

As Boolean is already in the arrow type system I think it might be worth
asking the question as to whether this should be an extension type or a
first class type.

Given what I think of the  last discussion on the trade-offs [1], I think
there is room for debate here, since Boolean is not currently
parameterized, adding it as an existing type would require a new top level
type.

Thanks,
Micah

[1] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t

On Wed, Jul 17, 2024 at 9:44 PM Alenka Frim <frim.ale...@gmail.com> wrote:

Thank you Joel for working on this! I have also came across
the need for a byte packed boolean support when implementing the
Python dataframe interchange protocol and also DPack which
is implemented in Arrow C++. The extension type is a great solution.

I will comment on the PR if I have any questions.

Alenka

V V sre., 17. jul. 2024 ob 23:32 je oseba Ian Cook <ianmc...@apache.org>
napisala:

Thanks Joel and Matt. This looks good to me.

I think it's worth saying here that Arrow-producing components should
still
by default emit Booleans in the standard bit-packed Arrow layout. This
proposed bool8 canonical extension type is intended to be used in
applications where the producer knows that the consumer can correctly
interpret the bool8 extension type and where using it is more efficient
than converting the data to the standard bit-packed layout.

Ian

On Wed, Jul 17, 2024 at 5:19 PM Matt Topol <zotthewiz...@gmail.com>
wrote:

Just chiming in that the libcudf documentation[1] states that this
proposal
should work just fine. Bool8 type is described as "0 == false, else
true".

--Matt

[1]:




https://docs.rapids.ai/api/libcudf/stable/group__utility__types#gadf077607da617d1dadcc5417e2783539

On Wed, Jul 17, 2024, 3:18 PM Joel Lubinitsky <joell...@gmail.com>
wrote:

Thank you for your comments.

I spent some time trying to confirm definitively that this proposal
would
enable zero copy sharing both ways between pyarrow and numpy. I put
together the following gist [1] with my experiment.

To summarize the results:
- I was able to share the underlying value buffer both ways and
have
it
be
interpreted correctly in each case.
- Numpy will write 0 or 1 to the value buffer to indicate False or
True.
Importantly, numpy will also understand values outside this range
to
mean
True without requiring a copy. This tracks closely with the
proposed
semantics.

[1]:
https://gist.github.com/joellubi/2ddf626633b57839cfd5f32cd94a7f3b

On Wed, Jul 17, 2024 at 10:16 AM Ian Cook <ianmc...@apache.org>
wrote:

Before the vote, I would like to see verification that this
truly
enables
zero-copy to/from NumPy bool arrays in Python.

I think this is an implementation issue more than a
specification
issue...I am not personally worried about any provisions on the
specification that might make this impossible.

To clarify, what I am looking for here is definite confirmation
that
the proposed representation (in which a signed int8 zero value
indicates
False and any non-zero signed int8 value indicates True)
corresponds
to
the
representation used by NumPy such that bidirectional zero-copy is
made
possible. This seems to me like a specification issue.

Ian

On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

Thank you for this! I have definitely run across the
one-byte-per-item
bool in numpy, DuckDB, and cudf. I haven't heard any discussion
about
DuckDB here but I am fairly sure that they represent their
boolean
type as an int8 as well [1].

Before the vote, I would like to see verification that this
truly
enables
zero-copy to/from NumPy bool arrays in Python.

I think this is an implementation issue more than a
specification
issue...I am not personally worried about any provisions on the
specification that might make this impossible.

-dewey

[1]






https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37

On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <
anto...@python.org>
wrote:


Hi Joel,

This looks good to me on the principle. Can you split the
spec
and
the
implementation(s) into separate PRs?

Regards

Antoine.


Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
Hi Arrow devs,

I'm working on adding an extension type for 8-bit booleans,
and
wanted
to
start a discussion about it here because it could be
valuable
to
others if
adopted as a canonical extension type.

The native implementation of the Boolean type uses 1 bit to
encode
each
value, enabling a very compact representation. This is
favorable
for
many
workloads, but lots of systems that want to produce/consume
Boolean
arrays
use an 8-bit representation internally and are forced to
copy/convert
at
their periphery. For these scenarios where zero-copy
compatibility
is
important, the 8-bit representation of boolean values may
be
preferred.
This can benefit interactions with existing libraries that
avoid
packing
column data like 1-bit booleans for parallelization
purposes,
including GPU
libraries such as libcudf. The original issue [1]
identifies
numpy
conversion as a specific use-case as well.

The details of the extension type can be found in the draft
PR
[2]
which
contains a Go implementation (WIP) and an update to the
documentation
for
canonical extension types. I plan to add a C++
implementation
as
well
but
wanted to open this discussion first.

A quick overview of the layout / semantics proposed in the
PR:
Storage Type: Int8
Value Semantics: 0 == false, any non-zero value is true

I'd appreciate any feedback here or on the PR. If this all
seems
reasonable
then I'll move forward with the next implementation and
open
up
another
proposal for a formal vote. Thanks!

[1]: https://github.com/apache/arrow/issues/17682
[2]: https://github.com/apache/arrow/pull/43234









Reply via email to