Hi Dewey,
Yes, I thought about this possibility. I think only nanoarrow would work
as it provides options for private namespacing and therefore preventing
ABI issues. However, the main concern is what happens at the Python -
Numba boundary.
Specifically:
- how are Python objects (such as PyArrow arrays) unwrapped into native
arrays, when a Numba function is entered
- how are native arrays wrapped as Python objects at the return path
Then we need to consider the different types of objects:
- arrays or record batches could go through the PyCapsule-based
protocol, but it needs to be implemented. Such an implementation already
exists in PyArrow and nanoarrow-python, which is an argument for basing
off of this existing work.
- chunked arrays or tables could similarly go through the C Stream
Interface, but that's a complicated (and slightly costly) indirection
- and that's not talking about ancillary types such as scalars, which we
probably want to support
A possible solution for all this would be to expose a small, private,
Numba-specific ABI (*) in PyArrow and implement the bulk of the
functionality as a separate project. For the earlier phases of
development, though, that would make development more cumbersome as we
would need to iterate on both projects in lockstep (and have the
separate project depend on unreleased versions of PyArrow).
(*) As a matter of fact, my current proof of concept relies on a small
set of C functions for which we can emit calls from Numba codegen (LLVM
doesn't know about C++ or C, only about the platform ABI):
https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8
Regards
Antoine.
Le 30/03/2026 à 04:05, Dewey Dunnington a écrit :
Cool!
Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you
develop this as a separate project without adding another component to
PyArrow?
Cheers,
-dewey
On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]> wrote:
Hi Vignesh
As per the release schedule concerns that argument doesn't hold up.
Otherwise we would have to tie our releases to Numpy, Pandas or
others.
It is just business as usual to test against a set of versions on our
CI and keep our releases independent from any third party.
Obviously any new feature to the project has a maintenance burden
associated with it but I am unsure about the "potential dilution of
pyarrow's core focus as a universal columnar data layer". Enabling
better support and integrations with the Python scientific computing
ecosystem has been part of the scope of the project.
And as Antoine mentioned, the integration needs C++ internals without
a stable ABI, which makes an external package fragile. That's, as far
as I understand it, the same reason our pandas/NumPy integration lives
in PyArrow.
Regards,
Raúl
El vie, 27 mar 2026 a las 4:28, Vignesh Siva
(<[email protected]>) escribió:
Thanks, Li Jin,
While integrating the Numba layer directly into PyArrow offers benefits
like potentially simpler user experience and direct access to C++ internals
without ABI concerns, there are several potential downsides from the
perspective of PyArrow's core development and project management. Firstly,
it would significantly increase the maintenance burden on the PyArrow
development team. This includes not only supporting the Numba integration
code itself but also ensuring its compatibility with future Numba and Arrow
releases and debugging issues specific to this integration. This could
divert resources from PyArrow's core mission and broader development.
Secondly, it could lead to an expansion of PyArrow's scope and a potential
dilution of its core focus as a universal columnar data layer. Adding
highly specialized integrations, even optional ones, can make the project
larger and more complex for new contributors to navigate. It also ties the
release cycles of Numba-specific features to PyArrow's release schedule,
which might not always align. An external package, while facing ABI
challenges, allows for more agile development, independent release cycles,
and a dedicated community focused solely on the Numba-PyArrow interface,
without adding overhead to the main PyArrow project.
Regards,
Vignesh
On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote:
Hi Antoine,
This is exciting work. I am generally in favor of putting inside PyArrow
for easy of use and ABI reasons above. Can you explain a bit more what are
the downsides of putting in PyArrow vs a separate package?
Li
On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]>
wrote:
Hello,
Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python
that allows to speed up scientific calculations written in Python. Out
of the box, Numba supports Numpy arrays (which was the primary target
for its design).
We (at QuantStack) have been investigating the feasibility of supporting
a subset of PyArrow in Numba, so that the fast computation abilities of
Numba can extend to data in the Arrow format.
We have come to the conclusion that supporting a small subset of PyArrow
is definitely doable, at a competitive performance level (between "as
fast as C++" and "4x slower" on a couple preliminary micro-benchmarks).
(by "small subset" we mostly mean: primitive data types, reading and
building arrays)
The Numba integration layer would ideally have to be maintained and
distributed within PyArrow, because of the need to access a number of
Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to
work around this by exporting a dedicated C-like ABI from PyArrow,
though).
What we would like to know is how the community feels about putting this
code inside PyArrow, rather than a separate package, for the reason
given above.
This would *not* add a dependency on Numba, since this can be exposed as
a dynamically-loaded extension point:
https://numba.readthedocs.io/en/stable/extending/entrypoints.html
(note: this preliminary investigation was supported by one of our fine
customers)
Regards
Antoine.