Re: [Discuss][Python] Numba support for PyArrow

Antoine Pitrou Tue, 31 Mar 2026 00:02:50 -0700


Vignesh,

Can you stop posting AI-generated messages? This is bringing zero valueto the discussion.


Antoine.


Le 31/03/2026 à 08:59, Vignesh Siva a écrit :

Hi All,

Thank you for outlining the detailed considerations and the proposed path
forward regarding Numba integration with PyArrow. The challenge posed by
the unstable Arrow C++ ABI is certainly a significant hurdle.

The concept of exposing a private, Numba-specific ABI within PyArrow, while
offloading the more extensive Numba functionality to a separate project,
presents an interesting compromise. This approach seems to address the
immediate need for performance and direct API access without necessarily
burdening PyArrow with the full scope of Numba development. However, I
share the concerns about the potential for increased maintenance overhead
within PyArrow and the initial cumbersome development workflow you
mentioned.

I believe this solution warrants careful consideration, particularly in
terms of its long-term implications for PyArrow's architecture and the
maintainability of the exposed ABI. It would be valuable to gather further
community feedback on whether this internal exposure of a Numba-specific
ABI aligns with the project's broader strategy and how we envision managing
potential versioning and stability issues. What are your thoughts on
mitigating the initial development complexities and ensuring clear
boundaries between the core PyArrow and the Numba integration layer?

Regards,
Vignesh

On Mon, 30 Mar 2026 at 20:17, Dewey Dunnington <[email protected]>
wrote:

Hi Antoine,

Perhaps this could be done by a python-level dependency? (e.g., wrap
input with pyarrow.array(); wrap output in pyarrow.array() to keep
using pyarrow objects for input/output). I would be surprised if the
overhead of ChunkedArray -> ArrowArrayStream is meaningful here but
I'm also not plugged into the details so feel free to ignore me :).
This is also only overhead for pyarrow input (polars/arro3/other
non-pyarrow input would have to do this anyway).

(I'm not opposed to adding this to pyarrow...it's just sometimes
floated that it's tricky to attract pyarrow maintainers and this
particular bit of extra code to maintain/keep up-to-date with another
third-part library seems avoidable)

Cheers,

-dewey

On Mon, Mar 30, 2026 at 3:28 AM Antoine Pitrou <[email protected]> wrote:



Hi Dewey,

Yes, I thought about this possibility. I think only nanoarrow would work
as it provides options for private namespacing and therefore preventing
ABI issues. However, the main concern is what happens at the Python -
Numba boundary.

Specifically:
- how are Python objects (such as PyArrow arrays) unwrapped into native
arrays, when a Numba function is entered
- how are native arrays wrapped as Python objects at the return path

Then we need to consider the different types of objects:

- arrays or record batches could go through the PyCapsule-based
protocol, but it needs to be implemented. Such an implementation already
exists in PyArrow and nanoarrow-python, which is an argument for basing
off of this existing work.

- chunked arrays or tables could similarly go through the C Stream
Interface, but that's a complicated (and slightly costly) indirection

- and that's not talking about ancillary types such as scalars, which we
probably want to support

A possible solution for all this would be to expose a small, private,
Numba-specific ABI (*) in PyArrow and implement the bulk of the
functionality as a separate project. For the earlier phases of
development, though, that would make development more cumbersome as we
would need to iterate on both projects in lockstep (and have the
separate project depend on unreleased versions of PyArrow).


(*) As a matter of fact, my current proof of concept relies on a small
set of C functions for which we can emit calls from Numba codegen (LLVM
doesn't know about C++ or C, only about the platform ABI):

https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8


Regards

Antoine.



Le 30/03/2026 à 04:05, Dewey Dunnington a écrit :

Cool!

Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you
develop this as a separate project without adding another component to
PyArrow?

Cheers,

-dewey

On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]>

wrote:


Hi Vignesh

As per the release schedule concerns that argument doesn't hold up.
Otherwise we would have to tie our releases to Numpy, Pandas or
others.

It is just business as usual to test against a set of versions on our
CI and keep our releases independent from any third party.

Obviously any new feature to the project has a maintenance burden
associated with it but I am unsure about the "potential dilution of
pyarrow's core focus as a universal columnar data layer". Enabling
better support and integrations with the Python scientific computing
ecosystem has been part of the scope of the project.

And as Antoine mentioned, the integration needs C++ internals without
a stable ABI, which makes an external package fragile. That's, as far
as I understand it, the same reason our pandas/NumPy integration lives
in PyArrow.

Regards,
Raúl

El vie, 27 mar 2026 a las 4:28, Vignesh Siva
(<[email protected]>) escribió:


Thanks, Li Jin,

While integrating the Numba layer directly into PyArrow offers

benefits

like potentially simpler user experience and direct access to C++

internals

without ABI concerns, there are several potential downsides from the
perspective of PyArrow's core development and project management.

Firstly,

it would significantly increase the maintenance burden on the PyArrow
development team. This includes not only supporting the Numba

integration

code itself but also ensuring its compatibility with future Numba

and Arrow

releases and debugging issues specific to this integration. This

could

divert resources from PyArrow's core mission and broader development.

Secondly, it could lead to an expansion of PyArrow's scope and a

potential

dilution of its core focus as a universal columnar data layer. Adding
highly specialized integrations, even optional ones, can make the

project

larger and more complex for new contributors to navigate. It also

ties the

release cycles of Numba-specific features to PyArrow's release

schedule,

which might not always align. An external package, while facing ABI
challenges, allows for more agile development, independent release

cycles,

and a dedicated community focused solely on the Numba-PyArrow

interface,

without adding overhead to the main PyArrow project.

Regards,
Vignesh

On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote:

Hi Antoine,

This is exciting work. I am generally in favor of putting inside

PyArrow

for easy of use and ABI reasons above. Can you explain a bit more

what are

the downsides of putting in PyArrow vs a separate package?

Li

On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]

wrote:


Hello,

Numba (https://numba.pydata.org/) is a Just-in-Time compiler for

Python

that allows to speed up scientific calculations written in Python.

Out

of the box, Numba supports Numpy arrays (which was the primary

target

for its design).

We (at QuantStack) have been investigating the feasibility of

supporting

a subset of PyArrow in Numba, so that the fast computation

abilities of

Numba can extend to data in the Arrow format.

We have come to the conclusion that supporting a small subset of

PyArrow

is definitely doable, at a competitive performance level (between

"as

fast as C++" and "4x slower" on a couple preliminary

micro-benchmarks).


(by "small subset" we mostly mean: primitive data types, reading

and

building arrays)

The Numba integration layer would ideally have to be maintained and
distributed within PyArrow, because of the need to access a number

of

Arrow C++ APIs, which don't have a stable ABI (it *might* be

possible to

work around this by exporting a dedicated C-like ABI from PyArrow,

though).


What we would like to know is how the community feels about

putting this

code inside PyArrow, rather than a separate package, for the reason
given above.

This would *not* add a dependency on Numba, since this can be

exposed as

a dynamically-loaded extension point:
https://numba.readthedocs.io/en/stable/extending/entrypoints.html

(note: this preliminary investigation was supported by one of our

fine

customers)

Regards

Antoine.

Re: [Discuss][Python] Numba support for PyArrow

Reply via email to