Apologies for being very late to this discussion, but if anyone is still interested in this work, I did quite long ago attempt something like this at https://github.com/westonsteimel/pyarrow-parquet. Eventually I gave up on that approach (due to the time taken for builds etc) and instead moved to taking the published wheels and stripping them down to only what I wanted at https://github.com/westonsteimel/pyarrow-slim. I haven't updated that in quite some time, but perhaps it can serve as a useful starting point.
Thanks, --Weston Steimel On Mon, 10 Oct 2022, 13:08 Wes McKinney, <wesmck...@gmail.com> wrote: > We've discussed this in the past, I think. In addition to having many > optional components enabled, the pyarrow wheel also includes the unit > tests directory which is of growing size. I think if we made a > pyarrow-slim wheel with support only for core Arrow (IPC, etc.) and > Parquet file reading, it might be possible to trim by significant > percentage. > > Rusty -- if you would like to push this forward I would suggest > creating an alternative wheel build script to the one that we use and > modify flags / add other customizations (e.g. trimming unit tests) > that produce a wheel that we could build and possibly upload as > "pyarrow-slim" on PyPI > > On Mon, Oct 3, 2022 at 8:55 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > Hi Rusty, > > > > Le 02/10/2022 à 22:51, Rusty Conover a écrit : > > > Hi Arrow Team, > > > > > > I'm using Apache Arrow with AWS Lambda Functions. > > > > > > The primary motivation is AWS Athena's user-defined functions[1]. > Those > > > functions process and return Arrow IPC segments. > > > > > > * The published Python wheels for Apache Arrow include almost every > feature > > > of Arrow. (Gandiva, Plasma, Flight) > > > > Gandiva isn't compiled in the Python wheels. Plasma is reasonably small > > (but is also being deprecated soon). Flight is more sizable. However, > > most of the size seems to be in Arrow itself and Parquet. A large part > > of the size is probably attributable to the Arrow compute engine and > > functions, and also perhaps to filesystem implementations such as S3 and > > GCS (due to the large third-party dependencies that they bundle). > > > > > Would it be possible to create a new Python package (i.e., > "pyarrow-slim") > > > that would disable some of the functionality but result in smaller > python > > > wheels? > > > > Perhaps. The first step would be to allow disabling more components in > > PyArrow, though. Otherwise I'm afraid the size reduction wouldn't be > > terrific. > > > > Regards > > > > Antoine. >