Re: [Discuss][Python] Numba support for PyArrow

Alenka Frim Tue, 31 Mar 2026 22:59:51 -0700

Hello,

I don't think I am the right person to respond due to being generally
overoptimistic and uncritical of any new idea =) That being said, I think
adding basic Arrow array support in Numba feels like a very interesting
thing to do and would not be against adding the code to PyArrow (mainly
because I do not think this would be abandoned since it is a suggestion
from you Antoine).


Might Numba support attract interest from the community to help us find and
fund more active PyArrow maintainers?

All best,
Alenka

V V tor., 31. mar. 2026 ob 09:02 je oseba Antoine Pitrou <[email protected]>
napisala:

>
> Vignesh,
>
> Can you stop posting AI-generated messages? This is bringing zero value
> to the discussion.
>
> Antoine.
>
>
> Le 31/03/2026 à 08:59, Vignesh Siva a écrit :
> > Hi All,
> >
> > Thank you for outlining the detailed considerations and the proposed path
> > forward regarding Numba integration with PyArrow. The challenge posed by
> > the unstable Arrow C++ ABI is certainly a significant hurdle.
> >
> > The concept of exposing a private, Numba-specific ABI within PyArrow,
> while
> > offloading the more extensive Numba functionality to a separate project,
> > presents an interesting compromise. This approach seems to address the
> > immediate need for performance and direct API access without necessarily
> > burdening PyArrow with the full scope of Numba development. However, I
> > share the concerns about the potential for increased maintenance overhead
> > within PyArrow and the initial cumbersome development workflow you
> > mentioned.
> >
> > I believe this solution warrants careful consideration, particularly in
> > terms of its long-term implications for PyArrow's architecture and the
> > maintainability of the exposed ABI. It would be valuable to gather
> further
> > community feedback on whether this internal exposure of a Numba-specific
> > ABI aligns with the project's broader strategy and how we envision
> managing
> > potential versioning and stability issues. What are your thoughts on
> > mitigating the initial development complexities and ensuring clear
> > boundaries between the core PyArrow and the Numba integration layer?
> >
> > Regards,
> > Vignesh
> >
> > On Mon, 30 Mar 2026 at 20:17, Dewey Dunnington <
> [email protected]>
> > wrote:
> >
> >> Hi Antoine,
> >>
> >> Perhaps this could be done by a python-level dependency? (e.g., wrap
> >> input with pyarrow.array(); wrap output in pyarrow.array() to keep
> >> using pyarrow objects for input/output). I would be surprised if the
> >> overhead of ChunkedArray -> ArrowArrayStream is meaningful here but
> >> I'm also not plugged into the details so feel free to ignore me :).
> >> This is also only overhead for pyarrow input (polars/arro3/other
> >> non-pyarrow input would have to do this anyway).
> >>
> >> (I'm not opposed to adding this to pyarrow...it's just sometimes
> >> floated that it's tricky to attract pyarrow maintainers and this
> >> particular bit of extra code to maintain/keep up-to-date with another
> >> third-part library seems avoidable)
> >>
> >> Cheers,
> >>
> >> -dewey
> >>
> >> On Mon, Mar 30, 2026 at 3:28 AM Antoine Pitrou <[email protected]>
> wrote:
> >>>
> >>>
> >>> Hi Dewey,
> >>>
> >>> Yes, I thought about this possibility. I think only nanoarrow would
> work
> >>> as it provides options for private namespacing and therefore preventing
> >>> ABI issues. However, the main concern is what happens at the Python -
> >>> Numba boundary.
> >>>
> >>> Specifically:
> >>> - how are Python objects (such as PyArrow arrays) unwrapped into native
> >>> arrays, when a Numba function is entered
> >>> - how are native arrays wrapped as Python objects at the return path
> >>>
> >>> Then we need to consider the different types of objects:
> >>>
> >>> - arrays or record batches could go through the PyCapsule-based
> >>> protocol, but it needs to be implemented. Such an implementation
> already
> >>> exists in PyArrow and nanoarrow-python, which is an argument for basing
> >>> off of this existing work.
> >>>
> >>> - chunked arrays or tables could similarly go through the C Stream
> >>> Interface, but that's a complicated (and slightly costly) indirection
> >>>
> >>> - and that's not talking about ancillary types such as scalars, which
> we
> >>> probably want to support
> >>>
> >>> A possible solution for all this would be to expose a small, private,
> >>> Numba-specific ABI (*) in PyArrow and implement the bulk of the
> >>> functionality as a separate project. For the earlier phases of
> >>> development, though, that would make development more cumbersome as we
> >>> would need to iterate on both projects in lockstep (and have the
> >>> separate project depend on unreleased versions of PyArrow).
> >>>
> >>>
> >>> (*) As a matter of fact, my current proof of concept relies on a small
> >>> set of C functions for which we can emit calls from Numba codegen (LLVM
> >>> doesn't know about C++ or C, only about the platform ABI):
> >>>
> >>>
> >>
> https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>>
> >>> Le 30/03/2026 à 04:05, Dewey Dunnington a écrit :
> >>>> Cool!
> >>>>
> >>>> Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you
> >>>> develop this as a separate project without adding another component to
> >>>> PyArrow?
> >>>>
> >>>> Cheers,
> >>>>
> >>>> -dewey
> >>>>
> >>>> On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]>
> >> wrote:
> >>>>>
> >>>>> Hi Vignesh
> >>>>>
> >>>>> As per the release schedule concerns that argument doesn't hold up.
> >>>>> Otherwise we would have to tie our releases to Numpy, Pandas or
> >>>>> others.
> >>>>>
> >>>>> It is just business as usual to test against a set of versions on our
> >>>>> CI and keep our releases independent from any third party.
> >>>>>
> >>>>> Obviously any new feature to the project has a maintenance burden
> >>>>> associated with it but I am unsure about the "potential dilution of
> >>>>> pyarrow's core focus as a universal columnar data layer". Enabling
> >>>>> better support and integrations with the Python scientific computing
> >>>>> ecosystem has been part of the scope of the project.
> >>>>>
> >>>>> And as Antoine mentioned, the integration needs C++ internals without
> >>>>> a stable ABI, which makes an external package fragile. That's, as far
> >>>>> as I understand it, the same reason our pandas/NumPy integration
> lives
> >>>>> in PyArrow.
> >>>>>
> >>>>> Regards,
> >>>>> Raúl
> >>>>>
> >>>>> El vie, 27 mar 2026 a las 4:28, Vignesh Siva
> >>>>> (<[email protected]>) escribió:
> >>>>>>
> >>>>>> Thanks, Li Jin,
> >>>>>>
> >>>>>> While integrating the Numba layer directly into PyArrow offers
> >> benefits
> >>>>>> like potentially simpler user experience and direct access to C++
> >> internals
> >>>>>> without ABI concerns, there are several potential downsides from the
> >>>>>> perspective of PyArrow's core development and project management.
> >> Firstly,
> >>>>>> it would significantly increase the maintenance burden on the
> PyArrow
> >>>>>> development team. This includes not only supporting the Numba
> >> integration
> >>>>>> code itself but also ensuring its compatibility with future Numba
> >> and Arrow
> >>>>>> releases and debugging issues specific to this integration. This
> >> could
> >>>>>> divert resources from PyArrow's core mission and broader
> development.
> >>>>>>
> >>>>>> Secondly, it could lead to an expansion of PyArrow's scope and a
> >> potential
> >>>>>> dilution of its core focus as a universal columnar data layer.
> Adding
> >>>>>> highly specialized integrations, even optional ones, can make the
> >> project
> >>>>>> larger and more complex for new contributors to navigate. It also
> >> ties the
> >>>>>> release cycles of Numba-specific features to PyArrow's release
> >> schedule,
> >>>>>> which might not always align. An external package, while facing ABI
> >>>>>> challenges, allows for more agile development, independent release
> >> cycles,
> >>>>>> and a dedicated community focused solely on the Numba-PyArrow
> >> interface,
> >>>>>> without adding overhead to the main PyArrow project.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Vignesh
> >>>>>>
> >>>>>> On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hi Antoine,
> >>>>>>>
> >>>>>>> This is exciting work. I am generally in favor of putting inside
> >> PyArrow
> >>>>>>> for easy of use and ABI reasons above. Can you explain a bit more
> >> what are
> >>>>>>> the downsides of putting in PyArrow vs a separate package?
> >>>>>>>
> >>>>>>> Li
> >>>>>>>
> >>>>>>> On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <
> [email protected]
> >>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> Numba (https://numba.pydata.org/) is a Just-in-Time compiler for
> >> Python
> >>>>>>>> that allows to speed up scientific calculations written in Python.
> >> Out
> >>>>>>>> of the box, Numba supports Numpy arrays (which was the primary
> >> target
> >>>>>>>> for its design).
> >>>>>>>>
> >>>>>>>> We (at QuantStack) have been investigating the feasibility of
> >> supporting
> >>>>>>>> a subset of PyArrow in Numba, so that the fast computation
> >> abilities of
> >>>>>>>> Numba can extend to data in the Arrow format.
> >>>>>>>>
> >>>>>>>> We have come to the conclusion that supporting a small subset of
> >> PyArrow
> >>>>>>>> is definitely doable, at a competitive performance level (between
> >> "as
> >>>>>>>> fast as C++" and "4x slower" on a couple preliminary
> >> micro-benchmarks).
> >>>>>>>>
> >>>>>>>> (by "small subset" we mostly mean: primitive data types, reading
> >> and
> >>>>>>>> building arrays)
> >>>>>>>>
> >>>>>>>> The Numba integration layer would ideally have to be maintained
> and
> >>>>>>>> distributed within PyArrow, because of the need to access a number
> >> of
> >>>>>>>> Arrow C++ APIs, which don't have a stable ABI (it *might* be
> >> possible to
> >>>>>>>> work around this by exporting a dedicated C-like ABI from PyArrow,
> >>>>>>> though).
> >>>>>>>>
> >>>>>>>> What we would like to know is how the community feels about
> >> putting this
> >>>>>>>> code inside PyArrow, rather than a separate package, for the
> reason
> >>>>>>>> given above.
> >>>>>>>>
> >>>>>>>> This would *not* add a dependency on Numba, since this can be
> >> exposed as
> >>>>>>>> a dynamically-loaded extension point:
> >>>>>>>> https://numba.readthedocs.io/en/stable/extending/entrypoints.html
> >>>>>>>>
> >>>>>>>> (note: this preliminary investigation was supported by one of our
> >> fine
> >>>>>>>> customers)
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> Antoine.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>
> >>
> >
>
>

Re: [Discuss][Python] Numba support for PyArrow

Reply via email to