[GitHub] [arrow-rs] crepererum opened a new issue, #2321: `ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates

GitBox Thu, 04 Aug 2022 07:09:28 -0700


crepererum opened a new issue, #2321:
URL: https://github.com/apache/arrow-rs/issues/2321


   **Describe the bug**
   The `batch_size` passed to 
`ParquetFileArrowReader::get_record_reader[_by_colum]` results in allocating 
that many records in memory even when the file contains less data. This is a 
bit unfortunate (or dangerous) because this parameter is really hard to 
estimate. In a system that reads and writes parquet files, you may assume that 
the files written only contain a reasonable amount of data (in bytes), but 
don't know how many rows there are. Even looking at the parquet file and the 
file-level metadata will tell you that it's OK to read everything, so you 
optimistically just pass a very high `batch_size`... and OOM your process.
   
   **To Reproduce**
   No isolated code yet, but it roughly goes as follows:
   
   1. create a reasonably sized parquet file with a single record batch and 
some columns
   2. use a really large `batch_size` when reading the file.
   
   **Expected behavior**
   The row counts are known at least within the file-level parquet metadata  
(and probably other places), so they should be applied as a limit before 
allocating buffers.
   
   **Additional context**
   Occurs with arrow version 19.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] crepererum opened a new issue, #2321: `ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates

Reply via email to