Hello,
Numba (https://numba.pydata.org/) is a Just-in-Time compiler for Python
that allows to speed up scientific calculations written in Python. Out
of the box, Numba supports Numpy arrays (which was the primary target
for its design).
We (at QuantStack) have been investigating the feasibility of supporting
a subset of PyArrow in Numba, so that the fast computation abilities of
Numba can extend to data in the Arrow format.
We have come to the conclusion that supporting a small subset of PyArrow
is definitely doable, at a competitive performance level (between "as
fast as C++" and "4x slower" on a couple preliminary micro-benchmarks).
(by "small subset" we mostly mean: primitive data types, reading and
building arrays)
The Numba integration layer would ideally have to be maintained and
distributed within PyArrow, because of the need to access a number of
Arrow C++ APIs, which don't have a stable ABI (it *might* be possible to
work around this by exporting a dedicated C-like ABI from PyArrow, though).
What we would like to know is how the community feels about putting this
code inside PyArrow, rather than a separate package, for the reason
given above.
This would *not* add a dependency on Numba, since this can be exposed as
a dynamically-loaded extension point:
https://numba.readthedocs.io/en/stable/extending/entrypoints.html
(note: this preliminary investigation was supported by one of our fine
customers)
Regards
Antoine.