[GitHub] [arrow] bkietz commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

GitBox Thu, 18 Feb 2021 08:16:16 -0800


bkietz commented on pull request #9482:
URL: https://github.com/apache/arrow/pull/9482#issuecomment-781458352



   The `once_flag` approach sounds viable in principle. However I'm concerned 
that it could cause lots of threads to block for a single pre-buffering, then 
block them all pre-buffering for the next file, etc. For example, a parquet 
fragment with 42 row groups will generate a scan task for each, and if one of 
those tasks is scheduled first in each thread (as in Scanner::ToTable) then I 
think they'd block the whole pool.
   
   Fundamentally, it's part of the contract of ScanTask that two ScanTasks are 
independent, and any shared pre-buffering violates that. I think the correct 
fix for right now is to write another subclass of `ScanTask` which wraps a 
whole pre-buffered file reader rather than individual row groups. This will 
earn us most of the perf advantage of pre-buffering in the common case of 
scanning many files at the cost of losing parallelism when scanning few files.
   
   Eventually we'll use `Future`s to resolve this, probably by replacing
   ```diff
   -  virtual Result<RecordBatchIterator> Execute() = 0;
   +  virtual AsyncGenerator<std::shared_ptr<RecordBatch>> Execute() = 0;
   ```
   this will enable more fluid handling of the wait-for-pre-buffer: the first 
record batch from scan tasks which happen to originate from the same file will 
simply wait a bit longer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on pull request #9482: ARROW-11601: [C++][Python][Dataset] expose Parquet pre-buffer option

Reply via email to