lidavidm commented on pull request #9482: URL: https://github.com/apache/arrow/pull/9482#issuecomment-781439694
The issue is that the buffer is shared at the Parquet FileReader level, so if we enable the pre-buffer option at the Arrow RowGroupReader level, executing scan tasks may trample that buffer (each will try to buffer a different row group). We don't notice it in unit tests since pre-buffering is trivial. One solution would be to manually pre-buffer once at the 'top level', when we create scan tasks. That has some advantages: all the scan tasks would share the same buffer, so there's more opportunities for I/O savings. (If you read the last column of row group 0 and the first column of row group 1, then you'd be able to coalesce those into a single I/O operation.) But that means generating the scan tasks would trigger work that happens right away, which is undesirable. We could give each scan task its own copy of the Parquet reader. This would be OK, but we'll need some refactoring so that we don't re-open and re-read the Parquet footer for each scan task. We also won't get those potential I/O coalescing opportunities. We could generate a synthetic scan task that starts the pre-buffering. I think that isn't safe, though, for users who execute scan tasks in parallel. We could hand each scan task a shared_ptr<once_flag> that handles the pre-buffering (if present); each scan task would then call that once_flag before continuing. I think I like this the best. @bkietz What do you think? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
