[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

via GitHub Wed, 01 Feb 2023 17:28:30 -0800


rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1413018420


   I wanted to second this issue as I am having the same problem. In my case 
the problem stems from the python package 
[deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses 
the arrow format. We use `deltalake` to read from Delta with arrow because 
Spark is less performant in many cases. However, when trying 
`dataset.to_batches()` it appears that all available memory is quickly consumed 
even if the dataset is not very large (e.g. 100M rows x 50 cols). I have 
reviewed the documentation and its not clear what I can do to resolve the issue 
in its current state. Any suggestions workarounds would be much appreciated. We 
are using `pyarrow==10.0.1` and `deltalake==0.6.3`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Reply via email to