wesm edited a comment on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-621266083
Yes, we should discuss on the mailing list. EDIT: we do have a separate thread pool for IO, but it's limited to 8 threads. Eventually absent a path forward on sane nested parallelism, we're going to continue to see either highly suboptimal performance or scenarios where we can't use parallelism because of the risk of deadlocks. In the meantime, I think we need to create an explicit scheduler API (probably higher level / more abstracted than the current ThreadPool API) so that an application can make sense of the IO tasks that are being issued when reading multiple files in parallel. This would extend to the Datasets API presumably also. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
