rando-brando commented on issue #33759:
URL: https://github.com/apache/arrow/issues/33759#issuecomment-1415832545

   > > I wanted to second this issue as I am having the same problem. In my 
case the problem stems from the python package 
[deltalake](https://github.com/delta-io/delta-rs/tree/main/python) which uses 
the arrow format. We use deltalake to read from Delta with arrow because Spark 
is less performant in many cases. However, when trying dataset.to_batches() it 
appears that all available memory is quickly consumed even if the dataset is 
not very large (e.g. 100M rows x 50 cols). I have reviewed the documentation 
and its not clear what I can do to resolve the issue in its current state. Any 
suggestions workarounds would be much appreciated. We are using pyarrow==10.0.1 
and deltalake==0.6.3.
   > 
   > Do you also have many files with large amounts of metadata? If you do not 
then I suspect it is unrelated to this issue. I'd like to avoid umbrella issues 
of "sometimes some queries use more RAM than expected".
   > 
   > #33624 is (as much as I can tell) referring to I/O bandwidth and not total 
RAM usage. So it also sounds like a different situation. Perhaps you can open 
your own issue with some details about the dataset you are trying to read (how 
many files? What RAM consumption are you expecting? What RAM consumption are 
you seeing?)
   
   I thought it was likely related both issues are caused when using 
‘to_batches()’ on small data with the difference being I am reading directly 
from a mounted disk and the OP is reading over the network. If the scanner is 
the cause as some comments have suggested both our issues would be resolved by 
a fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to