Vignesh,
Can you stop posting AI-generated messages? This is bringing zero value
to the discussion.
Antoine.
Le 31/03/2026 à 08:59, Vignesh Siva a écrit :
Hi All,
Thank you for outlining the detailed considerations and the proposed path
forward regarding Numba integration with PyArrow. The challenge posed by
the unstable Arrow C++ ABI is certainly a significant hurdle.
The concept of exposing a private, Numba-specific ABI within PyArrow, while
offloading the more extensive Numba functionality to a separate project,
presents an interesting compromise. This approach seems to address the
immediate need for performance and direct API access without necessarily
burdening PyArrow with the full scope of Numba development. However, I
share the concerns about the potential for increased maintenance overhead
within PyArrow and the initial cumbersome development workflow you
mentioned.
I believe this solution warrants careful consideration, particularly in
terms of its long-term implications for PyArrow's architecture and the
maintainability of the exposed ABI. It would be valuable to gather further
community feedback on whether this internal exposure of a Numba-specific
ABI aligns with the project's broader strategy and how we envision managing
potential versioning and stability issues. What are your thoughts on
mitigating the initial development complexities and ensuring clear
boundaries between the core PyArrow and the Numba integration layer?
Regards,
Vignesh
On Mon, 30 Mar 2026 at 20:17, Dewey Dunnington <[email protected]>
wrote:
Hi Antoine,
Perhaps this could be done by a python-level dependency? (e.g., wrap
input with pyarrow.array(); wrap output in pyarrow.array() to keep
using pyarrow objects for input/output). I would be surprised if the
overhead of ChunkedArray -> ArrowArrayStream is meaningful here but
I'm also not plugged into the details so feel free to ignore me :).
This is also only overhead for pyarrow input (polars/arro3/other
non-pyarrow input would have to do this anyway).
(I'm not opposed to adding this to pyarrow...it's just sometimes
floated that it's tricky to attract pyarrow maintainers and this
particular bit of extra code to maintain/keep up-to-date with another
third-part library seems avoidable)
Cheers,
-dewey
On Mon, Mar 30, 2026 at 3:28 AM Antoine Pitrou <[email protected]> wrote:
Hi Dewey,
Yes, I thought about this possibility. I think only nanoarrow would work
as it provides options for private namespacing and therefore preventing
ABI issues. However, the main concern is what happens at the Python -
Numba boundary.
Specifically:
- how are Python objects (such as PyArrow arrays) unwrapped into native
arrays, when a Numba function is entered
- how are native arrays wrapped as Python objects at the return path
Then we need to consider the different types of objects:
- arrays or record batches could go through the PyCapsule-based
protocol, but it needs to be implemented. Such an implementation already
exists in PyArrow and nanoarrow-python, which is an argument for basing
off of this existing work.
- chunked arrays or tables could similarly go through the C Stream
Interface, but that's a complicated (and slightly costly) indirection
- and that's not talking about ancillary types such as scalars, which we
probably want to support
A possible solution for all this would be to expose a small, private,
Numba-specific ABI (*) in PyArrow and implement the bulk of the
functionality as a separate project. For the earlier phases of
development, though, that would make development more cumbersome as we
would need to iterate on both projects in lockstep (and have the
separate project depend on unreleased versions of PyArrow).
(*) As a matter of fact, my current proof of concept relies on a small
set of C functions for which we can emit calls from Numba codegen (LLVM
doesn't know about C++ or C, only about the platform ABI):
https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8
Regards
Antoine.
Le 30/03/2026 à 04:05, Dewey Dunnington a écrit :
Cool!
Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you
develop this as a separate project without adding another component to
PyArrow?
Cheers,
-dewey
On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]>
wrote:
Hi Vignesh
As per the release schedule concerns that argument doesn't hold up.
Otherwise we would have to tie our releases to Numpy, Pandas or
others.
It is just business as usual to test against a set of versions on our
CI and keep our releases independent from any third party.
Obviously any new feature to the project has a maintenance burden
associated with it but I am unsure about the "potential dilution of
pyarrow's core focus as a universal columnar data layer". Enabling
better support and integrations with the Python scientific computing
ecosystem has been part of the scope of the project.
And as Antoine mentioned, the integration needs C++ internals without
a stable ABI, which makes an external package fragile. That's, as far
as I understand it, the same reason our pandas/NumPy integration lives
in PyArrow.
Regards,
Raúl
El vie, 27 mar 2026 a las 4:28, Vignesh Siva
(<[email protected]>) escribió:
Thanks, Li Jin,
While integrating the Numba layer directly into PyArrow offers
benefits
like potentially simpler user experience and direct access to C++
internals
without ABI concerns, there are several potential downsides from the
perspective of PyArrow's core development and project management.
Firstly,
it would significantly increase the maintenance burden on the PyArrow
development team. This includes not only supporting the Numba
integration
code itself but also ensuring its compatibility with future Numba
and Arrow
releases and debugging issues specific to this integration. This
could
divert resources from PyArrow's core mission and broader development.
Secondly, it could lead to an expansion of PyArrow's scope and a
potential
dilution of its core focus as a universal columnar data layer. Adding
highly specialized integrations, even optional ones, can make the
project
larger and more complex for new contributors to navigate. It also
ties the
release cycles of Numba-specific features to PyArrow's release
schedule,
which might not always align. An external package, while facing ABI
challenges, allows for more agile development, independent release
cycles,
and a dedicated community focused solely on the Numba-PyArrow
interface,
without adding overhead to the main PyArrow project.
Regards,
Vignesh
On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote:
Hi Antoine,
This is exciting work. I am generally in favor of putting inside
PyArrow
for easy of use and ABI reasons above. Can you explain a bit more
what are
the downsides of putting in PyArrow vs a separate package?
Li
On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]
wrote:
Hello,
Numba (https://numba.pydata.org/) is a Just-in-Time compiler for
Python
that allows to speed up scientific calculations written in Python.
Out
of the box, Numba supports Numpy arrays (which was the primary
target
for its design).
We (at QuantStack) have been investigating the feasibility of
supporting
a subset of PyArrow in Numba, so that the fast computation
abilities of
Numba can extend to data in the Arrow format.
We have come to the conclusion that supporting a small subset of
PyArrow
is definitely doable, at a competitive performance level (between
"as
fast as C++" and "4x slower" on a couple preliminary
micro-benchmarks).
(by "small subset" we mostly mean: primitive data types, reading
and
building arrays)
The Numba integration layer would ideally have to be maintained and
distributed within PyArrow, because of the need to access a number
of
Arrow C++ APIs, which don't have a stable ABI (it *might* be
possible to
work around this by exporting a dedicated C-like ABI from PyArrow,
though).
What we would like to know is how the community feels about
putting this
code inside PyArrow, rather than a separate package, for the reason
given above.
This would *not* add a dependency on Numba, since this can be
exposed as
a dynamically-loaded extension point:
https://numba.readthedocs.io/en/stable/extending/entrypoints.html
(note: this preliminary investigation was supported by one of our
fine
customers)
Regards
Antoine.