Re: [I] Supporting using parallel parquet writer outside of Datafusion query execution [arrow-datafusion]

via GitHub Thu, 14 Mar 2024 03:32:36 -0700


alamb commented on issue #9493:
URL: 
https://github.com/apache/arrow-datafusion/issues/9493#issuecomment-1997126661


   As we were discussing this API internally with @tustvold one thing he 
pointed out is that the current code pretty much requires using the same tokio 
threadpool for compute (parquet encoding) and I/O (the object store multi-part 
write). This can cause various problems, depending on what your system is 
doing. 
   
   Some discussion on CPU bound work in tokio: 
https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/
   
   Thus, one thing that would be nice to think about in this API is how we can 
support doing the IO (e.g. `put_mulitpart` on a different threadpool (aka tokio 
Runtime)
   
   I believe @tustvold  has also been thinking about this in the context of 
https://github.com/apache/arrow-rs/issues/5458 and may even be planning on 
porting some/all of the parallelized parquet writer upstream to parquet (I 
don't fully know the plan yet)
   
   Therefore, as we go through this exercise, we may want to help / join forces 
upstream / take those plans into account as we figure out the right API to 
extract
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Supporting using parallel parquet writer outside of Datafusion query execution [arrow-datafusion]

Reply via email to