Re: [PR] Move parallel parquet serialization to blocking threads [arrow-datafusion]

via GitHub Thu, 14 Mar 2024 15:44:51 -0700


devinjdangelo commented on PR #9605:
URL: 
https://github.com/apache/arrow-datafusion/pull/9605#issuecomment-1998603009


   > I agree with this assesment -- basically the threadpool running the 
DataFusion plan should be doing CPU work and ideally not also IO work
   
   I understand the context here for influxdb, but it would also be interesting 
to have a deeper discussion on this in the context of DataFusion as a 
standalone execution engine. I.e. should we be doing anything differently to 
make sure users of `datafusion-cli` running a query like
   
   ```SQL
   COPY (select * from s3://bucket/table) to 's3://bucket/parquet.file'
   ```
   
   won't run into poll latency issues reading/writing from remote object 
stores. Perhaps one of two things is true:
   
   1. Poll latency actually isn't that big of an issue for multipart 
objectstore writes. The batch sizes are small enough that the time between 
`.await`s will not negatively impact a streaming multipart write workload.
   2. Poll latency does cause unpredictable job failure and DataFusion should 
perhaps manage two tokio runtimes itself for IO/CPU or make more consistent use 
of `spawn_blocking`
   
   I have tested queries like the above myself (albeit unscientifically) and 
not run into any issues. It may be a good idea to test this more thoroughly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Move parallel parquet serialization to blocking threads [arrow-datafusion]

Reply via email to