Thanks, Li Jin,

While integrating the Numba layer directly into PyArrow offers benefits
like potentially simpler user experience and direct access to C++ internals
without ABI concerns, there are several potential downsides from the
perspective of PyArrow's core development and project management. Firstly,
it would significantly increase the maintenance burden on the PyArrow
development team. This includes not only supporting the Numba integration
code itself but also ensuring its compatibility with future Numba and Arrow
releases and debugging issues specific to this integration. This could
divert resources from PyArrow's core mission and broader development.

Secondly, it could lead to an expansion of PyArrow's scope and a potential
dilution of its core focus as a universal columnar data layer. Adding
highly specialized integrations, even optional ones, can make the project
larger and more complex for new contributors to navigate. It also ties the
release cycles of Numba-specific features to PyArrow's release schedule,
which might not always align. An external package, while facing ABI
challenges, allows for more agile development, independent release cycles,
and a dedicated community focused solely on the Numba-PyArrow interface,
without adding overhead to the main PyArrow project.

Regards,
Vignesh

On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote:

> Hi Antoine,
>
> This is exciting work. I am generally in favor of putting inside PyArrow
> for easy of use and ABI reasons above. Can you explain a bit more what are
> the downsides of putting in PyArrow vs a separate package?
>
> Li
>
> On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]>
> wrote:
>
> >
> > Hello,
> >
> > Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python
> > that allows to speed up scientific calculations written in Python. Out
> > of the box, Numba supports Numpy arrays (which was the primary target
> > for its design).
> >
> > We (at QuantStack) have been investigating the feasibility of supporting
> > a subset of PyArrow in Numba, so that the fast computation abilities of
> > Numba can extend to data in the Arrow format.
> >
> > We have come to the conclusion that supporting a small subset of PyArrow
> > is definitely doable, at a competitive performance level (between "as
> > fast as C++" and "4x slower" on a couple preliminary micro-benchmarks).
> >
> > (by "small subset" we mostly mean: primitive data types, reading and
> > building arrays)
> >
> > The Numba integration layer would ideally have to be maintained and
> > distributed within PyArrow, because of the need to access a number of
> > Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to
> > work around this by exporting a dedicated C-like ABI from PyArrow,
> though).
> >
> > What we would like to know is how the community feels about putting this
> > code inside PyArrow, rather than a separate package, for the reason
> > given above.
> >
> > This would *not* add a dependency on Numba, since this can be exposed as
> > a dynamically-loaded extension point:
> > https://numba.readthedocs.io/en/stable/extending/entrypoints.html
> >
> > (note: this preliminary investigation was supported by one of our fine
> > customers)
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Reply via email to