Hi Antoine, This is exciting work. I am generally in favor of putting inside PyArrow for easy of use and ABI reasons above. Can you explain a bit more what are the downsides of putting in PyArrow vs a separate package?
Li On Thu, Mar 26, 2026 at 11:08 AM Antoine Pitrou <[email protected]> wrote: > > Hello, > > Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python > that allows to speed up scientific calculations written in Python. Out > of the box, Numba supports Numpy arrays (which was the primary target > for its design). > > We (at QuantStack) have been investigating the feasibility of supporting > a subset of PyArrow in Numba, so that the fast computation abilities of > Numba can extend to data in the Arrow format. > > We have come to the conclusion that supporting a small subset of PyArrow > is definitely doable, at a competitive performance level (between "as > fast as C++" and "4x slower" on a couple preliminary micro-benchmarks). > > (by "small subset" we mostly mean: primitive data types, reading and > building arrays) > > The Numba integration layer would ideally have to be maintained and > distributed within PyArrow, because of the need to access a number of > Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to > work around this by exporting a dedicated C-like ABI from PyArrow, though). > > What we would like to know is how the community feels about putting this > code inside PyArrow, rather than a separate package, for the reason > given above. > > This would *not* add a dependency on Numba, since this can be exposed as > a dynamically-loaded extension point: > https://numba.readthedocs.io/en/stable/extending/entrypoints.html > > (note: this preliminary investigation was supported by one of our fine > customers) > > Regards > > Antoine. > >
