devinjdangelo commented on PR #9605: URL: https://github.com/apache/arrow-datafusion/pull/9605#issuecomment-1998603009
> I agree with this assesment -- basically the threadpool running the DataFusion plan should be doing CPU work and ideally not also IO work I understand the context here for influxdb, but it would also be interesting to have a deeper discussion on this in the context of DataFusion as a standalone execution engine. I.e. should we be doing anything differently to make sure users of `datafusion-cli` running a query like ```SQL COPY (select * from s3://bucket/table) to 's3://bucket/parquet.file' ``` won't run into poll latency issues reading/writing from remote object stores. Perhaps one of two things is true: 1. Poll latency actually isn't that big of an issue for multipart objectstore writes. The batch sizes are small enough that the time between `.await`s will not negatively impact a streaming multipart write workload. 2. Poll latency does cause unpredictable job failure and DataFusion should perhaps manage two tokio runtimes itself for IO/CPU or make more consistent use of `spawn_blocking` I have tested queries like the above myself (albeit unscientifically) and not run into any issues. It may be a good idea to test this more thoroughly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
