Re: [I] Supporting using parallel parquet writer outside of Datafusion query execution [arrow-datafusion]

via GitHub Thu, 14 Mar 2024 05:31:02 -0700


devinjdangelo commented on issue #9493:
URL: 
https://github.com/apache/arrow-datafusion/issues/9493#issuecomment-1997342596


   > As we were discussing this API internally with @tustvold one thing he 
pointed out is that the current code pretty much requires using the same tokio 
threadpool for compute (parquet encoding) and I/O (the object store multi-part 
write). This can cause various problems, depending on what your system is doing.
   
   I think the parallel writer facilitates a fairly straightforward way to move 
the most CPU heavy work (serialization) to a separate thread. Here is a draft 
PR #9605.
   
   I think the main objection to using `spawn_blocking` here is that tokio 
doesn't actually manage a blocking thread-pool, but just keeps spawning new 
threads. Rayon or a second tokio runtime would likely perform better in a case 
where we needed 100s of parallel tasks (e.g. 4 open row groups on a parquet 
file with 64 columns on a system with only 8 CPU cores). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Supporting using parallel parquet writer outside of Datafusion query execution [arrow-datafusion]

Reply via email to