alamb commented on PR #8802: URL: https://github.com/apache/arrow-datafusion/pull/8802#issuecomment-1883602861
>> In general, issuing a blocking call or performing a lot of compute in a future without yielding is problematic, as it may prevent the executor from driving other futures forward. I think this part of the tokio docs is confusing and somewhat misleading. Specifically what "a lot of compute" means is very dependent on the application. In this case, I think the actual serialization is done for a `RecordBatch` and then `await` is called to potentially yield control. I don't think serializing a single `RecordBatch` qualifies as "a lot of compute" I wrote up a detailed justification for using tokio for CPU bound tasks in the following blog post, which I think is still very relevant https://thenewstack.io/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/ > Previously, the serialization process was always asynchronous, even if there were no asynchronous calls involved, which was a CPU-bound operation. Almost all of a DataFusion plan's execution is CPU bound and not asynchronous, yet they are executed using `async` heavily. I don't think this is a problem (for reasons explained in the blog) and I don't see any reason we would want to treat writing batches differently. So all in all, I don't agree with the stated rationale of this PR that we should not use the same executor for CPU bounds tasks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
