[GitHub] [arrow] VHellendoorn commented on issue #12653: Conversion from one dataset to another that will not fit in memory?

GitBox Sun, 19 Jun 2022 18:04:52 -0700


VHellendoorn commented on issue #12653:
URL: https://github.com/apache/arrow/issues/12653#issuecomment-1159856484


   I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily 
increases to over 10GB while reading batches from a 15GB Parquet file, even 
with batch size 1. The rows vary a fair bit in size in this dataset, but not 
enough to require that much RAM.
   
   For what it's worth, I've found that passing `use_threads=False` as an 
argument to `scanner` prevents the memory footprint from growing as large (not 
growing past ~3GB in this case, but still fluctuating by a fair bit), after 
noticing that this implicitly disables both batch and fragment readahead 
[here](https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/dataset/scanner.h#L90).
 The performance penalty isn't particularly large, especially with bigger batch 
sizes, so this may be a temporary solution for those wishing to keep memory 
usage low.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] VHellendoorn commented on issue #12653: Conversion from one dataset to another that will not fit in memory?

Reply via email to