[GitHub] [arrow] mapleFU commented on issue #36765: [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets

via GitHub Tue, 26 Sep 2023 05:59:06 -0700


mapleFU commented on issue #36765:
URL: https://github.com/apache/arrow/issues/36765#issuecomment-1735491961


   ```
   Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& 
row_groups,
                                               const std::vector<int>& 
column_indices,
                                               
std::unique_ptr<RecordBatchReader>* out) {
     RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));
   
     if (reader_properties_.pre_buffer()) {
       // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if 
enabled
       BEGIN_PARQUET_CATCH_EXCEPTIONS
       reader_->PreBuffer(row_groups, column_indices, 
reader_properties_.io_context(),
                          reader_properties_.cache_options());
       END_PARQUET_CATCH_EXCEPTIONS
     }
   ```
   
   Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and 
memory will not be released until read is finished. It's different from 
buffering mode( actually buffering mode might decrease the memory usage, lol).
   
   Even when policy is lazy, the reader might not get faster if RowGroup is 
large enough, and memory will not be released before read is finished. So I 
wonder if this is ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on issue #36765: [Python][Dataset][Parquet] Enable Pre-Buffering by default for Parquet s3 datasets

Reply via email to