[I] `ParquetRecordBatchStream` API to fetch the next row group while decoding [arrow-rs]

via GitHub Mon, 14 Oct 2024 12:05:10 -0700


masonh22 opened a new issue, #6559:
URL: https://github.com/apache/arrow-rs/issues/6559


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   I've noticed low CPU utilization when reading from filesystems with low
   bandwidth using a `ParquetRecordBatchStream`. This appears to be caused by 
the
   fact that the stream fetches row group data on demand rather than ahead of
   time. In my specific scenario, I'm reading a parquet file from s3 with four
   128MB row groups. It takes ~2 seconds to fetch the data and ~500ms to decode 
the
   entire row group. In all, it takes around 10 seconds to read and decode the
   entire file.
   
   **Describe the solution you'd like**
   I'd like to add the option for `ParquetRecordBatchStream` to fetch the data 
for
   the next row group while decoding data for the current row group.
   
   **Describe alternatives you've considered**
   
   
   **Additional context**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] `ParquetRecordBatchStream` API to fetch the next row group while decoding [arrow-rs]

Reply via email to