rando-brando commented on issue #33759: URL: https://github.com/apache/arrow/issues/33759#issuecomment-1413018420
I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use `deltalake` to read from Delta with arrow because Spark is less performant in many cases. However, when trying `dataset.to_batches()` it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using `pyarrow==10.0.1` and `deltalake==0.6.3`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
