crepererum opened a new issue, #2321: URL: https://github.com/apache/arrow-rs/issues/2321
**Describe the bug** The `batch_size` passed to `ParquetFileArrowReader::get_record_reader[_by_colum]` results in allocating that many records in memory even when the file contains less data. This is a bit unfortunate (or dangerous) because this parameter is really hard to estimate. In a system that reads and writes parquet files, you may assume that the files written only contain a reasonable amount of data (in bytes), but don't know how many rows there are. Even looking at the parquet file and the file-level metadata will tell you that it's OK to read everything, so you optimistically just pass a very high `batch_size`... and OOM your process. **To Reproduce** No isolated code yet, but it roughly goes as follows: 1. create a reasonably sized parquet file with a single record batch and some columns 2. use a really large `batch_size` when reading the file. **Expected behavior** The row counts are known at least within the file-level parquet metadata (and probably other places), so they should be applied as a limit before allocating buffers. **Additional context** Occurs with arrow version 19. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
