[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

via GitHub Thu, 02 Feb 2023 17:57:05 -0800


rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1414623845


   > > I wanted to second this issue as I am having the same problem. In my 
case the problem stems from the python package 
[deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses 
the arrow format. We use deltalake to read from Delta with arrow because Spark 
is less performant in many cases. However, when trying dataset.to_batches() it 
appears that all available memory is quickly consumed even if the dataset is 
not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation 
and its not clear what I can do to resolve the issue in its current state. Any 
suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 
and deltalake==0.6.3.
   > 
   > Do you also have many files with large amounts of metadata? If you do not 
then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues 
of "sometimes some queries use more RAM than expected".
   > 
   > #33624 is (as much as I can tell) referring to I/O bandwidth and not total 
RAM usage. So it also sounds like a different situation. Perhaps you can open 
your own issue with some details about the dataset you are trying to read (how 
many files? What RAM consumption are you expecting? What RAM consumption are 
you seeing?)
   
   My issue is that when I use ‘to_batches()’ even on a small datasets sub 1GB 
my free memory is quickly consumed which often results in an OOM error. Based 
on the issue title and the description by the OP I thought the issue was 
similar or perhaps the same and did not require new issue. However, I can open 
a new one if you find it appropriate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] rando-brando commented on issue #33759: [Python][C++] How to limit the memory consumption of to_batches()

Reply via email to