westonpace commented on issue #12653: URL: https://github.com/apache/arrow/issues/12653#issuecomment-1071923418
At the moment we generally use too much memory when scanning parquet. This is because the scanner's readahead is unfortunately based on the row group size and not the batch size. Using smaller row groups in your source files will help. #12228 changes the readahead to be based on the batch size but it's been on my back burner for a bit. I'm still optimistic I will get to it for the 8.0.0 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
