I did a quick and dirty experiment and what I got was a segmentation fault, but I guess that a lot depends on what of the things you are using was inlined at compile time. I could get through the segfault by using PyImport_ImportModule to check if numpy exists and got a working minimal case where I could import pyarrow and create an array out of a python list. Just an hack far from being a decent solution or test, but I think that the most intertwined places are already behind a PyArray_Check and thus we could use that as a guard to avoid execution of numpy related code. It looks like the majority of the work would actually be in Cython, where by the way, how to deal with unavailable imports is much straightforward.
There are by the way some interesting points, like the fact that the mask for a pyarrow array can only be a numpy array, how could I create a masked array without numpy? I guess that accepting arrow arrays as mask is actually something we should allow anyway. On Mon, Aug 16, 2021 at 6:53 PM Antoine Pitrou <anto...@python.org> wrote: > > I agree that "what happens when Numpy is not available at runtime" is a > rather annoying problem. I'm not sure what happens when you call one > of the Numpy C API functions and Numpy is not found (crash? error > return?). It can probably be detected, but needs to be done > consistently at the start of each PyArrow core function, which requires > some care. > > At the end of the day, it looks like this would be a significant amount > of work for a relatively minor benefit (did people complain about > this?), so I'm not sure it's worth spending some time on it. > > Regards > > Antoine. > > > > On Mon, 16 Aug 2021 18:09:54 +0200 > Wes McKinney <wesmck...@gmail.com> wrote: > > I've thought about this in the past, and I would like to make NumPy an > > optional dependency, but one of the things that kept me from trying > > was the extent to which NumPy arrays are supported as inputs (or > > elements of inputs) to pyarrow.array. The implementation in > > python_to_arrow.cc is significantly intertwined with NumPy's C API. It > > might require maintaining two altogether different internal > > implementations of pyarrow.array, a complicated one which deals with > > all the NumPy oddities (including NumPy array scalars) and a much > > simpler one that does not. pyarrow may have to detect at runtime > > whether numpy is in sys.modules to decide whether to import and invoke > > the more complicated function. > > > > On Mon, Aug 16, 2021 at 5:59 PM Alessandro Molina > > <alessan...@ursacomputing.com> wrote: > > > > > > As Arrow/PyArrow grows more compute functions and features we might > move > > > toward a world where the number of users relying on PyArrow without > going > > > through Pandas or NumPy might grow. > > > > > > NumPy is a compile time dependency for PyArrow as it's required to > compile > > > the C++ code needed to implement the pandas/numpy integration, but > there > > > has been some discussion regard the fact that we could make NumPy > optional > > > at runtime (remove it from required dependencies in the Python > > > distribution). You would have to install numpy only if you need to > invoke > > > to_numpy or to_pandas methods or similar integration features. For all > the > > > other use cases, that rely on Arrow alone, you would be able to pip > install > > > pyarrow without involving any other dependency and be ready to go. > > > > > > Technically it seems a bit complicated, Python/Cython can always work > > > around missing libraries, but we would have to find ways to deal with > lazy > > > involvement of numpy from C++. I don't know if this is something that > was > > > already discussed in the past and thus someone already has solutions > for > > > this part of the problem, but before investing time and effort in > research > > > I think it made sense to make sure it's a goal that the development > team > > > agrees with. > > > > > >