[I] [Python] usability improvements for a "minimal" pyarrow [arrow]

via GitHub Wed, 01 Nov 2023 02:30:48 -0700


h-vetinari opened a new issue, #38536:
URL: https://github.com/apache/arrow/issues/38536


   ### Describe the enhancement requested
   
   Providing slimmer variants of arrow has been a topic for quite a while, but 
got more urgent with pandas 
[plan](https://github.com/pandas-dev/pandas/pull/52711) to depend on pyarrow, 
which would bring quite a substantial installation size increase, due to the 
way pyarrow gets packaged (this is true even more so in conda-forge, where we 
package a "maximal" version of arrow -- since it's so hard to build from source 
-- that generally contains more in terms of transitive dependencies than the 
wheels).
   
   Through 
[work](https://github.com/conda-forge/arrow-cpp-feedstock/issues/1035) on the 
[feedstock](https://github.com/conda-forge/arrow-cpp-feedstock/pull/1175), the 
conda-forge side of arrow is now 
[ready](https://github.com/conda-forge/arrow-cpp-feedstock/pull/1201) to split 
up libarrow 14.0 into several pieces (currently 
`libarrow-{acero,dataset,flight,flight-sql,gandiva,substrait}` + `libparquet`), 
but we're still having pyarrow depend on the entirety of libarrow, not least 
because the python bindings link to everything but `libarrow-flight-sql` 
directly:
   ```
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_acero.cpython-312-x86_64-linux-gnu.so):
           Needed DSO lib/libarrow_acero.so.1400     found in 
libarrow-acero-14.0.0-h59595ed_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_dataset.cpython-312-x86_64-linux-gnu.so):
         Needed DSO lib/libarrow_dataset.so.1400   found in 
libarrow-dataset-14.0.0-h59595ed_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_dataset_orc.cpython-312-x86_64-linux-gnu.so):
     Needed DSO lib/libarrow_dataset.so.1400   found in 
libarrow-dataset-14.0.0-h59595ed_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_dataset_parquet.cpython-312-x86_64-linux-gnu.so):
 Needed DSO lib/libarrow_dataset.so.1400   found in 
libarrow-dataset-14.0.0-h59595ed_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_flight.cpython-312-x86_64-linux-gnu.so):
          Needed DSO lib/libarrow_flight.so.1400    found in 
libarrow-flight-14.0.0-h35bba4a_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/_substrait.cpython-312-x86_64-linux-gnu.so):
       Needed DSO lib/libarrow_substrait.so.1400 found in 
libarrow-substrait-14.0.0-hab2db56_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/gandiva.cpython-312-x86_64-linux-gnu.so):
          Needed DSO lib/libgandiva.so.1400         found in 
libarrow-gandiva-14.0.0-hacb8726_0_cpu
   INFO 
(pyarrow,lib/python3.12/site-packages/pyarrow/libarrow_python_flight.so):       
                 Needed DSO lib/libarrow_flight.so.1400    found in 
libarrow-flight-14.0.0-h35bba4a_0_cpu
   INFO (pyarrow,lib/python3.12/site-packages/pyarrow/libarrow_python.so):      
                         Needed DSO lib/libparquet.so.1400         found in 
libparquet-14.0.0-h352af49_0_cpu
   ```
   
   While it would be theoretically possible to also build various `pyarrow-*` 
variants, that's quite unappealing IMO from a packaging perspective, and it 
would be nicer if `pyarrow` just depended on the (core) `libarrow`, but 
provided helpful error messages where any missing `libarrow-*` libraries 
actually get used. In such a scenario (c.f. discussion in 
https://github.com/conda-forge/arrow-cpp-feedstock/issues/1035),
   > @h-vetinari: [...] I think in terms of user-friendliness, we need to 
provide a better message than:
   > 
   > ```
   > ImportError: libarrow_dataset.so.1400: cannot open shared object file: No 
such file or directory
   > ```
   > 
   > I think it would make sense for arrow to define which libraries can be 
removed while still expecting core functionality to work, which dependencies 
each remaining artefact has, and provide _some_ error message upon not finding 
the respective library (which we can then patch on the feedstock to add 
messages like "install this additional package to get it"; alternatively, arrow 
could of course integrate that into the messages directly in this repo, à la 
"if you're using arrow from conda-forge, just install `libarrow-dataset`").
   
   Such an approach would presumably also make it easier for the wheel side of 
things (i.e. not having N `pyarrow-*` variants), though of course, providing 
the equivalent of the `libarrow-*` outputs from conda-forge through wheels 
would be quite a headache. It's possible that the best solution for wheels 
looks different (or ends up being sliced differently, like e.g. having two 
wheels `pyarrow` and `pyarrow-minimal`, or `pyarrow` and `pyarrow[full]`).
   
   Note also that (from 
https://github.com/conda-forge/arrow-cpp-feedstock/pull/1175):
   > @h-vetinari: [...] the new `libarrow` core library still depends on some 
of the most heavy-weight libraries at runtime (e.g. `libgoogle-cloud`, which is 
around ~30MB). I think it would make sense to separate out the pieces that 
depend on cloud-provider bindings into a separate output. Not sure how much 
work that is...
   
   This is now being tracked in https://github.com/apache/arrow/issues/38309.
   
   
   
   
   
   ### Component(s)
   
   Packaging, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] usability improvements for a "minimal" pyarrow [arrow]

Reply via email to