Re: [PR] Make serialization spawn_blocking in async runtime [arrow-datafusion]

via GitHub Tue, 09 Jan 2024 10:51:14 -0800


alamb commented on PR #8802:
URL: 
https://github.com/apache/arrow-datafusion/pull/8802#issuecomment-1883602861


   >> In general, issuing a blocking call or performing a lot of compute in a 
future without yielding is problematic, as it may prevent the executor from 
driving other futures forward.
   
   I think this part of the tokio docs is confusing and somewhat misleading. 
Specifically what "a lot of compute" means is very dependent on the application.
   
   In this case, I think the actual serialization is done for a `RecordBatch` 
and then `await` is called to potentially yield control. I don't think 
serializing a single `RecordBatch` qualifies as "a lot of compute"
   
   I wrote up a detailed justification for using tokio for CPU bound tasks in 
the following blog post, which I think is still very relevant 
https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/
   
   > Previously, the serialization process was always asynchronous, even if 
there were no asynchronous calls involved, which was a CPU-bound operation. 
   
   Almost all of a DataFusion plan's execution is CPU bound and not 
asynchronous, yet they are executed using `async` heavily. I don't think this 
is a problem (for reasons explained in the blog) and I don't see any reason we 
would want to treat writing batches differently.
   
   So all in all, I don't agree with the stated rationale of this PR that we 
should not use the same executor for CPU bounds tasks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Make serialization spawn_blocking in async runtime [arrow-datafusion]

Reply via email to