wesm edited a comment on pull request #6744:
URL: https://github.com/apache/arrow/pull/6744#issuecomment-621266083


   Yes, we should discuss on the mailing list. 
   
   EDIT: we do have a separate thread pool for IO, but it's limited to 8 
threads. Eventually absent a path forward on sane nested parallelism, we're 
going to continue to see either highly suboptimal performance or scenarios 
where we can't use parallelism because of the risk of deadlocks. 
   
   In the meantime, I think we need to create an explicit scheduler API 
(probably higher level / more abstracted than the current ThreadPool API) so 
that an application can make sense of the IO tasks that are being issued when 
reading multiple files in parallel. This would extend to the Datasets API 
presumably also. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to