[GitHub] [arrow] lidavidm edited a comment on pull request #9620: ARROW-11843: [C++] Provide reentrant Parquet reader

GitBox Thu, 01 Apr 2021 14:59:30 -0700


lidavidm edited a comment on pull request #9620:
URL: https://github.com/apache/arrow/pull/9620#issuecomment-812194926



   What I pushed is still not quite what I want. Ideally, we'd be able to ask 
the read cache for a future that finishes when all I/O for the given row group 
has completed. That way, we can then kick off a decoding task. On master, 
currently, you just spawn a bunch of tasks that block and wait for I/O and then 
proceed (wasting threads), and in this PR, we have hijinks to manually 
pre-buffer each row group separately (wasting the effectiveness of 
pre-buffering).
   
   That is, we should be able to say
   
   ```
   reader->PreBuffer(row_groups, columns)
   ...
   // I/O generator
   return reader->WhenBuffered({current_row_group}, {columns});
   
   // Decoding generator
   return cpu_executor_->Transfer(io_generator()).Then([]() { return 
ReadRowGroup(current_row_group); });
   ```
   
   and this will let us coalesce read ranges across row groups while only 
performing work on the CPU pool when it's truly ready.
   
   Also, the range cache will have to be swappable for something that just does 
normal file I/O for the non-S3 case so that local file scans are still 
reasonable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm edited a comment on pull request #9620: ARROW-11843: [C++] Provide reentrant Parquet reader

Reply via email to