kylebarron opened a new issue, #1227: URL: https://github.com/apache/datafusion-python/issues/1227
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** PyArrow is a massive dependency. Unpacked, it tends to be >100MB in size, and, until the latest versions (I think?) also required numpy as its own non-optional dependency. It's also, in effect the only current dependency https://github.com/apache/datafusion-python/blob/f0bbad7543717c5f08ba2acb92d42c9d30fd2355/pyproject.toml#L46 It would be great if we could remove it, and that would greatly lessen the minimal environment size for datafusion python. [Many other Python Arrow libraries](https://github.com/apache/arrow/issues/39195#issuecomment-2245718008) implement the PyCapsule Interface, so the user can use nanoarrow, arro3, Polars, DuckDB, etc, or pyarrow. Whatever is best for them. **Describe the solution you'd like** The Arrow PyCapsule Interface is a lightweight, decentralized protocol for sharing Arrow data between Python libraries. We already implement the PyCapsule Interface, so it's just a matter of removing places where we hard-code use of pyarrow. **Describe alternatives you've considered** Keep pyarrow dependency. **Additional context** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
