Cool! Just wondering: if you used arrow-rs, sparrow, or nanoarrow, could you develop this as a separate project without adding another component to PyArrow?
Cheers, -dewey On Fri, Mar 27, 2026 at 3:29 AM Raúl Cumplido <[email protected]> wrote: > > Hi Vignesh > > As per the release schedule concerns that argument doesn't hold up. > Otherwise we would have to tie our releases to Numpy, Pandas or > others. > > It is just business as usual to test against a set of versions on our > CI and keep our releases independent from any third party. > > Obviously any new feature to the project has a maintenance burden > associated with it but I am unsure about the "potential dilution of > pyarrow's core focus as a universal columnar data layer". Enabling > better support and integrations with the Python scientific computing > ecosystem has been part of the scope of the project. > > And as Antoine mentioned, the integration needs C++ internals without > a stable ABI, which makes an external package fragile. That's, as far > as I understand it, the same reason our pandas/NumPy integration lives > in PyArrow. > > Regards, > Raúl > > El vie, 27 mar 2026 a las 4:28, Vignesh Siva > (<[email protected]>) escribió: > > > > Thanks, Li Jin, > > > > While integrating the Numba layer directly into PyArrow offers benefits > > like potentially simpler user experience and direct access to C++ internals > > without ABI concerns, there are several potential downsides from the > > perspective of PyArrow's core development and project management. Firstly, > > it would significantly increase the maintenance burden on the PyArrow > > development team. This includes not only supporting the Numba integration > > code itself but also ensuring its compatibility with future Numba and Arrow > > releases and debugging issues specific to this integration. This could > > divert resources from PyArrow's core mission and broader development. > > > > Secondly, it could lead to an expansion of PyArrow's scope and a potential > > dilution of its core focus as a universal columnar data layer. Adding > > highly specialized integrations, even optional ones, can make the project > > larger and more complex for new contributors to navigate. It also ties the > > release cycles of Numba-specific features to PyArrow's release schedule, > > which might not always align. An external package, while facing ABI > > challenges, allows for more agile development, independent release cycles, > > and a dedicated community focused solely on the Numba-PyArrow interface, > > without adding overhead to the main PyArrow project. > > > > Regards, > > Vignesh > > > > On Fri, 27 Mar 2026 at 05:51, Li Jin <[email protected]> wrote: > > > > > Hi Antoine, > > > > > > This is exciting work. I am generally in favor of putting inside PyArrow > > > for easy of use and ABI reasons above. Can you explain a bit more what are > > > the downsides of putting in PyArrow vs a separate package? > > > > > > Li > > > > > > On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]> > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python > > > > that allows to speed up scientific calculations written in Python. Out > > > > of the box, Numba supports Numpy arrays (which was the primary target > > > > for its design). > > > > > > > > We (at QuantStack) have been investigating the feasibility of supporting > > > > a subset of PyArrow in Numba, so that the fast computation abilities of > > > > Numba can extend to data in the Arrow format. > > > > > > > > We have come to the conclusion that supporting a small subset of PyArrow > > > > is definitely doable, at a competitive performance level (between "as > > > > fast as C++" and "4x slower" on a couple preliminary micro-benchmarks). > > > > > > > > (by "small subset" we mostly mean: primitive data types, reading and > > > > building arrays) > > > > > > > > The Numba integration layer would ideally have to be maintained and > > > > distributed within PyArrow, because of the need to access a number of > > > > Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to > > > > work around this by exporting a dedicated C-like ABI from PyArrow, > > > though). > > > > > > > > What we would like to know is how the community feels about putting this > > > > code inside PyArrow, rather than a separate package, for the reason > > > > given above. > > > > > > > > This would *not* add a dependency on Numba, since this can be exposed as > > > > a dynamically-loaded extension point: > > > > https://numba.readthedocs.io/en/stable/extending/entrypoints.html > > > > > > > > (note: this preliminary investigation was supported by one of our fine > > > > customers) > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > >
