[GitHub] [arrow] mapleFU commented on pull request #37854: GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files

via GitHub Tue, 26 Sep 2023 03:52:44 -0700


mapleFU commented on PR #37854:
URL: https://github.com/apache/arrow/pull/37854#issuecomment-1735294931


   ```c++
   Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& 
row_groups,
                                               const std::vector<int>& 
column_indices,
                                               
std::unique_ptr<RecordBatchReader>* out) {
     RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));
   
     if (reader_properties_.pre_buffer()) {
       // PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if 
enabled
       BEGIN_PARQUET_CATCH_EXCEPTIONS
       reader_->PreBuffer(row_groups, column_indices, 
reader_properties_.io_context(),
                          reader_properties_.cache_options());
       END_PARQUET_CATCH_EXCEPTIONS
     }
   ```
   
   Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and 
memory will not be released until read is finished. It's different from 
`buffering` mode( actually buffering mode might decrease the memory usage, lol).
   
   Even when policy is `lazy`, the reader might not get faster if RowGroup is 
large enough, and memory will not be released before read is finished. So I 
wonder if this is ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mapleFU commented on pull request #37854: GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files

Reply via email to