Hello,

Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python that allows to speed up scientific calculations written in Python. Out of the box, Numba supports Numpy arrays (which was the primary target for its design).

We (at QuantStack) have been investigating the feasibility of supporting a subset of PyArrow in Numba, so that the fast computation abilities of Numba can extend to data in the Arrow format.

We have come to the conclusion that supporting a small subset of PyArrow is definitely doable, at a competitive performance level (between "as fast as C++" and "4x slower" on a couple preliminary micro-benchmarks).

(by "small subset" we mostly mean: primitive data types, reading and building arrays)

The Numba integration layer would ideally have to be maintained and distributed within PyArrow, because of the need to access a number of Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to work around this by exporting a dedicated C-like ABI from PyArrow, though).

What we would like to know is how the community feels about putting this code inside PyArrow, rather than a separate package, for the reason given above.

This would *not* add a dependency on Numba, since this can be exposed as a dynamically-loaded extension point:
https://numba.readthedocs.io/en/stable/extending/entrypoints.html

(note: this preliminary investigation was supported by one of our fine customers)

Regards

Antoine.

Reply via email to