Hi Antoine, Perhaps this could be done by a python-level dependency? (e.g., wrap input with pyarrow.array(); wrap output in pyarrow.array() to keep using pyarrow objects for input/output). I would be surprised if the overhead of ChunkedArray -> ArrowArrayStream is meaningful here but I'm also not plugged into the details so feel free to ignore me :). This is also only overhead for pyarrow input (polars/arro3/other non-pyarrow input would have to do this anyway).
(I'm not opposed to adding this to pyarrow...it's just sometimes floated that it's tricky to attract pyarrow maintainers and this particular bit of extra code to maintain/keep up-to-date with another third-part library seems avoidable) Cheers, -dewey On Mon, Mar 30, 2026 at 3:28 AM Antoine Pitrou <[email protected]> wrote: > > > Hi Dewey, > > Yes, I thought about this possibility. I think only nanoarrow would work > as it provides options for private namespacing and therefore preventing > ABI issues. However, the main concern is what happens at the Python - > Numba boundary. > > Specifically: > - how are Python objects (such as PyArrow arrays) unwrapped into native > arrays, when a Numba function is entered > - how are native arrays wrapped as Python objects at the return path > > Then we need to consider the different types of objects: > > - arrays or record batches could go through the PyCapsule-based > protocol, but it needs to be implemented. Such an implementation already > exists in PyArrow and nanoarrow-python, which is an argument for basing > off of this existing work. > > - chunked arrays or tables could similarly go through the C Stream > Interface, but that's a complicated (and slightly costly) indirection > > - and that's not talking about ancillary types such as scalars, which we > probably want to support > > A possible solution for all this would be to expose a small, private, > Numba-specific ABI (*) in PyArrow and implement the bulk of the > functionality as a separate project. For the earlier phases of > development, though, that would make development more cumbersome as we > would need to iterate on both projects in lockstep (and have the > separate project depend on unreleased versions of PyArrow). > > > (*) As a matter of fact, my current proof of concept relies on a small > set of C functions for which we can emit calls from Numba codegen (LLVM > doesn't know about C++ or C, only about the platform ABI): > > https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8 > > Regards > > Antoine. > > > > Le 30/03/2026 à 04:05, Dewey Dunnington a écrit : > > Cool! > > > > Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you > > develop this as a separate project without adding another component to > > PyArrow? > > > > Cheers, > > > > -dewey > > > > On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]> wrote: > >> > >> Hi Vignesh > >> > >> As per the release schedule concerns that argument doesn't hold up. > >> Otherwise we would have to tie our releases to Numpy, Pandas or > >> others. > >> > >> It is just business as usual to test against a set of versions on our > >> CI and keep our releases independent from any third party. > >> > >> Obviously any new feature to the project has a maintenance burden > >> associated with it but I am unsure about the "potential dilution of > >> pyarrow's core focus as a universal columnar data layer". Enabling > >> better support and integrations with the Python scientific computing > >> ecosystem has been part of the scope of the project. > >> > >> And as Antoine mentioned, the integration needs C++ internals without > >> a stable ABI, which makes an external package fragile. That's, as far > >> as I understand it, the same reason our pandas/NumPy integration lives > >> in PyArrow. > >> > >> Regards, > >> Raúl > >> > >> El vie, 27 mar 2026 a las 4:28, Vignesh Siva > >> (<[email protected]>) escribió: > >>> > >>> Thanks, Li Jin, > >>> > >>> While integrating the Numba layer directly into PyArrow offers benefits > >>> like potentially simpler user experience and direct access to C++ > >>> internals > >>> without ABI concerns, there are several potential downsides from the > >>> perspective of PyArrow's core development and project management. Firstly, > >>> it would significantly increase the maintenance burden on the PyArrow > >>> development team. This includes not only supporting the Numba integration > >>> code itself but also ensuring its compatibility with future Numba and > >>> Arrow > >>> releases and debugging issues specific to this integration. This could > >>> divert resources from PyArrow's core mission and broader development. > >>> > >>> Secondly, it could lead to an expansion of PyArrow's scope and a potential > >>> dilution of its core focus as a universal columnar data layer. Adding > >>> highly specialized integrations, even optional ones, can make the project > >>> larger and more complex for new contributors to navigate. It also ties the > >>> release cycles of Numba-specific features to PyArrow's release schedule, > >>> which might not always align. An external package, while facing ABI > >>> challenges, allows for more agile development, independent release cycles, > >>> and a dedicated community focused solely on the Numba-PyArrow interface, > >>> without adding overhead to the main PyArrow project. > >>> > >>> Regards, > >>> Vignesh > >>> > >>> On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote: > >>> > >>>> Hi Antoine, > >>>> > >>>> This is exciting work. I am generally in favor of putting inside PyArrow > >>>> for easy of use and ABI reasons above. Can you explain a bit more what > >>>> are > >>>> the downsides of putting in PyArrow vs a separate package? > >>>> > >>>> Li > >>>> > >>>> On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]> > >>>> wrote: > >>>> > >>>>> > >>>>> Hello, > >>>>> > >>>>> Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python > >>>>> that allows to speed up scientific calculations written in Python. Out > >>>>> of the box, Numba supports Numpy arrays (which was the primary target > >>>>> for its design). > >>>>> > >>>>> We (at QuantStack) have been investigating the feasibility of supporting > >>>>> a subset of PyArrow in Numba, so that the fast computation abilities of > >>>>> Numba can extend to data in the Arrow format. > >>>>> > >>>>> We have come to the conclusion that supporting a small subset of PyArrow > >>>>> is definitely doable, at a competitive performance level (between "as > >>>>> fast as C++" and "4x slower" on a couple preliminary micro-benchmarks). > >>>>> > >>>>> (by "small subset" we mostly mean: primitive data types, reading and > >>>>> building arrays) > >>>>> > >>>>> The Numba integration layer would ideally have to be maintained and > >>>>> distributed within PyArrow, because of the need to access a number of > >>>>> Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to > >>>>> work around this by exporting a dedicated C-like ABI from PyArrow, > >>>> though). > >>>>> > >>>>> What we would like to know is how the community feels about putting this > >>>>> code inside PyArrow, rather than a separate package, for the reason > >>>>> given above. > >>>>> > >>>>> This would *not* add a dependency on Numba, since this can be exposed as > >>>>> a dynamically-loaded extension point: > >>>>> https://numba.readthedocs.io/en/stable/extending/entrypoints.html > >>>>> > >>>>> (note: this preliminary investigation was supported by one of our fine > >>>>> customers) > >>>>> > >>>>> Regards > >>>>> > >>>>> Antoine. > >>>>> > >>>>> > >>>> >
