lidavidm edited a comment on pull request #9620: URL: https://github.com/apache/arrow/pull/9620#issuecomment-812194926
What I pushed is still not quite what I want. Ideally, we'd be able to ask the read cache for a future that finishes when all I/O for the given row group has completed. That way, we can then kick off a decoding task. On master, currently, you just spawn a bunch of tasks that block and wait for I/O and then proceed (wasting threads), and in this PR, we have hijinks to manually pre-buffer each row group separately (wasting the effectiveness of pre-buffering). That is, we should be able to say ``` reader->PreBuffer(row_groups, columns) ... // I/O generator return reader->WhenBuffered({current_row_group}, {columns}); // Decoding generator return cpu_executor_->Transfer(io_generator()).Then([]() { return ReadRowGroup(current_row_group); }); ``` and this will let us coalesce read ranges across row groups while only performing work on the CPU pool when it's truly ready. Also, the range cache will have to be swappable for something that just does normal file I/O for the non-S3 case so that local file scans are still reasonable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org