Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Ian Cook Wed, 17 Jul 2024 07:17:44 -0700

>> Before the vote, I would like to see verification that this truly enables
>> zero-copy to/from NumPy bool arrays in Python.


> I think this is an implementation issue more than a specification
issue...I am not personally worried about any provisions on the
specification that might make this impossible.

To clarify, what I am looking for here is definite confirmation that
the proposed representation (in which a signed int8 zero value indicates
False and any non-zero signed int8 value indicates True) corresponds to the
representation used by NumPy such that bidirectional zero-copy is made
possible. This seems to me like a specification issue.

Ian

On Wed, Jul 17, 2024 at 9:39 AM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

> Thank you for this! I have definitely run across the one-byte-per-item
> bool in numpy, DuckDB, and cudf. I haven't heard any discussion about
> DuckDB here but I am fairly sure that they represent their boolean
> type as an int8 as well [1].
>
> > Before the vote, I would like to see verification that this truly enables
> > zero-copy to/from NumPy bool arrays in Python.
>
> I think this is an implementation issue more than a specification
> issue...I am not personally worried about any provisions on the
> specification that might make this impossible.
>
> -dewey
>
> [1]
> https://github.com/duckdb/duckdb/blob/85a82d86aa11a2695fc045deaf4f88fc63dd4fec/src/common/arrow/appender/bool_data.cpp#L28-L37
>
> On Tue, Jul 16, 2024 at 11:25 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >
> > Hi Joel,
> >
> > This looks good to me on the principle. Can you split the spec and the
> > implementation(s) into separate PRs?
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 16/07/2024 à 13:18, Joel Lubinitsky a écrit :
> > > Hi Arrow devs,
> > >
> > > I'm working on adding an extension type for 8-bit booleans, and wanted
> to
> > > start a discussion about it here because it could be valuable to
> others if
> > > adopted as a canonical extension type.
> > >
> > > The native implementation of the Boolean type uses 1 bit to encode each
> > > value, enabling a very compact representation. This is favorable for
> many
> > > workloads, but lots of systems that want to produce/consume Boolean
> arrays
> > > use an 8-bit representation internally and are forced to copy/convert
> at
> > > their periphery. For these scenarios where zero-copy compatibility is
> > > important, the 8-bit representation of boolean values may be preferred.
> > > This can benefit interactions with existing libraries that avoid
> packing
> > > column data like 1-bit booleans for parallelization purposes,
> including GPU
> > > libraries such as libcudf. The original issue [1] identifies numpy
> > > conversion as a specific use-case as well.
> > >
> > > The details of the extension type can be found in the draft PR [2]
> which
> > > contains a Go implementation (WIP) and an update to the documentation
> for
> > > canonical extension types. I plan to add a C++ implementation as well
> but
> > > wanted to open this discussion first.
> > >
> > > A quick overview of the layout / semantics proposed in the PR:
> > > Storage Type: Int8
> > > Value Semantics: 0 == false, any non-zero value is true
> > >
> > > I'd appreciate any feedback here or on the PR. If this all seems
> reasonable
> > > then I'll move forward with the next implementation and open up another
> > > proposal for a formal vote. Thanks!
> > >
> > > [1]: https://github.com/apache/arrow/issues/17682
> > > [2]: https://github.com/apache/arrow/pull/43234
> > >
>

Re: [DISCUSS] 8-bit Boolean Canonical Extension Type

Reply via email to