rando-brando commented on issue #33759: URL: https://github.com/apache/arrow/issues/33759#issuecomment-1414623845
> > I wanted to second this issue as I am having the same problem. In my case the problem stems from the python package [deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses the arrow format. We use deltalake to read from Delta with arrow because Spark is less performant in many cases. However, when trying dataset.to_batches() it appears that all available memory is quickly consumed even if the dataset is not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation and its not clear what I can do to resolve the issue in its current state. Any suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 and deltalake==0.6.3. > > Do you also have many files with large amounts of metadata? If you do not then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues of "sometimes some queries use more RAM than expected". > > #33624 is (as much as I can tell) referring to I/O bandwidth and not total RAM usage. So it also sounds like a different situation. Perhaps you can open your own issue with some details about the dataset you are trying to read (how many files? What RAM consumption are you expecting? What RAM consumption are you seeing?) My issue is that when I use ‘to_batches()’ even on a small datasets sub 1GB my free memory is quickly consumed which often results in an OOM error. Based on the issue title and the description by the OP I thought the issue was similar or perhaps the same and did not require new issue. However, I can open a new one if you find it appropriate. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
