VHellendoorn commented on issue #12653: URL: https://github.com/apache/arrow/issues/12653#issuecomment-1159856484
I am noticing the same issue with pyarrow 8.0.0. Memory usage steadily increases to over 10GB while reading batches from a 15GB Parquet file, even with batch size 1. The rows vary a fair bit in size in this dataset, but not enough to require that much RAM. For what it's worth, I've found that passing `use_threads=False` as an argument to `scanner` prevents the memory footprint from growing as large (not growing past ~3GB in this case, but still fluctuating by a fair bit), after noticing that this implicitly disables both batch and fragment readahead [here](https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/dataset/scanner.h#L90). The performance penalty isn't particularly large, especially with bigger batch sizes, so this may be a temporary solution for those wishing to keep memory usage low. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
