Thanks Raúl for taking care to make this minimally disruptive. This might be an inconvenience for some users of PyArrow, but I think the benefits outweigh the inconvenience.
Ian On Tue, Apr 9, 2024 at 11:17 AM Raúl Cumplido <rau...@apache.org> wrote: > Hi, > > As part of the effort to reduce the footprint of pyarrow > installations, we have been working on splitting pyarrow into separate > packages for conda [1]. Each package will pull different C++ > dependencies which will provide different capabilities. > > This PR [1] will provide 3 packages for pyarrow: > pyarrow-core < pyarrow < pyarrow-all > > - pyarrow-core: will pull the libarrow.so (~40MB) dependency. > - pyarrow: in addition to libarrow.so, will also pull libarrow_acero, > libarrow_dataset, libarrow_substrait and libparquet (~78MB) > dependencies. > - pyarrow-all: in addition to everything in pyarrow, will also pull > libarrow_flight, libarrow_flight_sql and libarrow_gandiva (~97MB). > > This means that if you are using conda and installing pyarrow today > with 16.0.0 you will see a reduction in the C++ dependencies size and > you will not have access to flight, flight_sql or gandiva. If you want > to keep using those you will have to install pyarrow-all. > > If you want to use a minimal pyarrow version without access to acero, > dataset, parquet or substrait you can use pyarrow-core and also get a > reduction in size. Bear in mind that the Arrow team is working on > moving the filesystems out of libarrow and that will be pulled out of > pyarrow-core in the future. This means that, probably, on 17.0.0 > parrow-core will not support S3, GCS or Azure Filesystems. > > The idea is to keep working on these efforts further to reduce pyarrow > size. > > Thanks everyone, > Raúl > > [1] https://github.com/conda-forge/arrow-cpp-feedstock/pull/1255 >