Hello, I don't think I am the right person to respond due to being generally overoptimistic and uncritical of any new idea =) That being said, I think adding basic Arrow array support in Numba feels like a very interesting thing to do and would not be against adding the code to PyArrow (mainly because I do not think this would be abandoned since it is a suggestion from you Antoine).
Might Numba support attract interest from the community to help us find and fund more active PyArrow maintainers? All best, Alenka V V tor., 31. mar. 2026 ob 09:02 je oseba Antoine Pitrou <[email protected]> napisala: > > Vignesh, > > Can you stop posting AI-generated messages? This is bringing zero value > to the discussion. > > Antoine. > > > Le 31/03/2026 à 08:59, Vignesh Siva a écrit : > > Hi All, > > > > Thank you for outlining the detailed considerations and the proposed path > > forward regarding Numba integration with PyArrow. The challenge posed by > > the unstable Arrow C++ ABI is certainly a significant hurdle. > > > > The concept of exposing a private, Numba-specific ABI within PyArrow, > while > > offloading the more extensive Numba functionality to a separate project, > > presents an interesting compromise. This approach seems to address the > > immediate need for performance and direct API access without necessarily > > burdening PyArrow with the full scope of Numba development. However, I > > share the concerns about the potential for increased maintenance overhead > > within PyArrow and the initial cumbersome development workflow you > > mentioned. > > > > I believe this solution warrants careful consideration, particularly in > > terms of its long-term implications for PyArrow's architecture and the > > maintainability of the exposed ABI. It would be valuable to gather > further > > community feedback on whether this internal exposure of a Numba-specific > > ABI aligns with the project's broader strategy and how we envision > managing > > potential versioning and stability issues. What are your thoughts on > > mitigating the initial development complexities and ensuring clear > > boundaries between the core PyArrow and the Numba integration layer? > > > > Regards, > > Vignesh > > > > On Mon, 30 Mar 2026 at 20:17, Dewey Dunnington < > [email protected]> > > wrote: > > > >> Hi Antoine, > >> > >> Perhaps this could be done by a python-level dependency? (e.g., wrap > >> input with pyarrow.array(); wrap output in pyarrow.array() to keep > >> using pyarrow objects for input/output). I would be surprised if the > >> overhead of ChunkedArray -> ArrowArrayStream is meaningful here but > >> I'm also not plugged into the details so feel free to ignore me :). > >> This is also only overhead for pyarrow input (polars/arro3/other > >> non-pyarrow input would have to do this anyway). > >> > >> (I'm not opposed to adding this to pyarrow...it's just sometimes > >> floated that it's tricky to attract pyarrow maintainers and this > >> particular bit of extra code to maintain/keep up-to-date with another > >> third-part library seems avoidable) > >> > >> Cheers, > >> > >> -dewey > >> > >> On Mon, Mar 30, 2026 at 3:28 AM Antoine Pitrou <[email protected]> > wrote: > >>> > >>> > >>> Hi Dewey, > >>> > >>> Yes, I thought about this possibility. I think only nanoarrow would > work > >>> as it provides options for private namespacing and therefore preventing > >>> ABI issues. However, the main concern is what happens at the Python - > >>> Numba boundary. > >>> > >>> Specifically: > >>> - how are Python objects (such as PyArrow arrays) unwrapped into native > >>> arrays, when a Numba function is entered > >>> - how are native arrays wrapped as Python objects at the return path > >>> > >>> Then we need to consider the different types of objects: > >>> > >>> - arrays or record batches could go through the PyCapsule-based > >>> protocol, but it needs to be implemented. Such an implementation > already > >>> exists in PyArrow and nanoarrow-python, which is an argument for basing > >>> off of this existing work. > >>> > >>> - chunked arrays or tables could similarly go through the C Stream > >>> Interface, but that's a complicated (and slightly costly) indirection > >>> > >>> - and that's not talking about ancillary types such as scalars, which > we > >>> probably want to support > >>> > >>> A possible solution for all this would be to expose a small, private, > >>> Numba-specific ABI (*) in PyArrow and implement the bulk of the > >>> functionality as a separate project. For the earlier phases of > >>> development, though, that would make development more cumbersome as we > >>> would need to iterate on both projects in lockstep (and have the > >>> separate project depend on unreleased versions of PyArrow). > >>> > >>> > >>> (*) As a matter of fact, my current proof of concept relies on a small > >>> set of C functions for which we can emit calls from Numba codegen (LLVM > >>> doesn't know about C++ or C, only about the platform ABI): > >>> > >>> > >> > https://github.com/apache/arrow/compare/main...pitrou:numba-exp#diff-a4b57ffdf0d6ab28e26bf1e4985b18669636f0ccb3af592c0dbd5789cb2ebef8 > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> > >>> Le 30/03/2026 à 04:05, Dewey Dunnington a écrit : > >>>> Cool! > >>>> > >>>> Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you > >>>> develop this as a separate project without adding another component to > >>>> PyArrow? > >>>> > >>>> Cheers, > >>>> > >>>> -dewey > >>>> > >>>> On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]> > >> wrote: > >>>>> > >>>>> Hi Vignesh > >>>>> > >>>>> As per the release schedule concerns that argument doesn't hold up. > >>>>> Otherwise we would have to tie our releases to Numpy, Pandas or > >>>>> others. > >>>>> > >>>>> It is just business as usual to test against a set of versions on our > >>>>> CI and keep our releases independent from any third party. > >>>>> > >>>>> Obviously any new feature to the project has a maintenance burden > >>>>> associated with it but I am unsure about the "potential dilution of > >>>>> pyarrow's core focus as a universal columnar data layer". Enabling > >>>>> better support and integrations with the Python scientific computing > >>>>> ecosystem has been part of the scope of the project. > >>>>> > >>>>> And as Antoine mentioned, the integration needs C++ internals without > >>>>> a stable ABI, which makes an external package fragile. That's, as far > >>>>> as I understand it, the same reason our pandas/NumPy integration > lives > >>>>> in PyArrow. > >>>>> > >>>>> Regards, > >>>>> Raúl > >>>>> > >>>>> El vie, 27 mar 2026 a las 4:28, Vignesh Siva > >>>>> (<[email protected]>) escribió: > >>>>>> > >>>>>> Thanks, Li Jin, > >>>>>> > >>>>>> While integrating the Numba layer directly into PyArrow offers > >> benefits > >>>>>> like potentially simpler user experience and direct access to C++ > >> internals > >>>>>> without ABI concerns, there are several potential downsides from the > >>>>>> perspective of PyArrow's core development and project management. > >> Firstly, > >>>>>> it would significantly increase the maintenance burden on the > PyArrow > >>>>>> development team. This includes not only supporting the Numba > >> integration > >>>>>> code itself but also ensuring its compatibility with future Numba > >> and Arrow > >>>>>> releases and debugging issues specific to this integration. This > >> could > >>>>>> divert resources from PyArrow's core mission and broader > development. > >>>>>> > >>>>>> Secondly, it could lead to an expansion of PyArrow's scope and a > >> potential > >>>>>> dilution of its core focus as a universal columnar data layer. > Adding > >>>>>> highly specialized integrations, even optional ones, can make the > >> project > >>>>>> larger and more complex for new contributors to navigate. It also > >> ties the > >>>>>> release cycles of Numba-specific features to PyArrow's release > >> schedule, > >>>>>> which might not always align. An external package, while facing ABI > >>>>>> challenges, allows for more agile development, independent release > >> cycles, > >>>>>> and a dedicated community focused solely on the Numba-PyArrow > >> interface, > >>>>>> without adding overhead to the main PyArrow project. > >>>>>> > >>>>>> Regards, > >>>>>> Vignesh > >>>>>> > >>>>>> On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote: > >>>>>> > >>>>>>> Hi Antoine, > >>>>>>> > >>>>>>> This is exciting work. I am generally in favor of putting inside > >> PyArrow > >>>>>>> for easy of use and ABI reasons above. Can you explain a bit more > >> what are > >>>>>>> the downsides of putting in PyArrow vs a separate package? > >>>>>>> > >>>>>>> Li > >>>>>>> > >>>>>>> On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou < > [email protected] > >>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> Numba (https://numba.pydata.org/) is a Just-in-Time compiler for > >> Python > >>>>>>>> that allows to speed up scientific calculations written in Python. > >> Out > >>>>>>>> of the box, Numba supports Numpy arrays (which was the primary > >> target > >>>>>>>> for its design). > >>>>>>>> > >>>>>>>> We (at QuantStack) have been investigating the feasibility of > >> supporting > >>>>>>>> a subset of PyArrow in Numba, so that the fast computation > >> abilities of > >>>>>>>> Numba can extend to data in the Arrow format. > >>>>>>>> > >>>>>>>> We have come to the conclusion that supporting a small subset of > >> PyArrow > >>>>>>>> is definitely doable, at a competitive performance level (between > >> "as > >>>>>>>> fast as C++" and "4x slower" on a couple preliminary > >> micro-benchmarks). > >>>>>>>> > >>>>>>>> (by "small subset" we mostly mean: primitive data types, reading > >> and > >>>>>>>> building arrays) > >>>>>>>> > >>>>>>>> The Numba integration layer would ideally have to be maintained > and > >>>>>>>> distributed within PyArrow, because of the need to access a number > >> of > >>>>>>>> Arrow C++ APIs, which don't have a stable ABI (it *might* be > >> possible to > >>>>>>>> work around this by exporting a dedicated C-like ABI from PyArrow, > >>>>>>> though). > >>>>>>>> > >>>>>>>> What we would like to know is how the community feels about > >> putting this > >>>>>>>> code inside PyArrow, rather than a separate package, for the > reason > >>>>>>>> given above. > >>>>>>>> > >>>>>>>> This would *not* add a dependency on Numba, since this can be > >> exposed as > >>>>>>>> a dynamically-loaded extension point: > >>>>>>>> https://numba.readthedocs.io/en/stable/extending/entrypoints.html > >>>>>>>> > >>>>>>>> (note: this preliminary investigation was supported by one of our > >> fine > >>>>>>>> customers) > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> Antoine. > >>>>>>>> > >>>>>>>> > >>>>>>> > >>> > >> > > > >
