Rachelint commented on issue #11451:
URL: https://github.com/apache/datafusion/issues/11451#issuecomment-2230941223

   > > But I still don't quite understand about `I think this is probably not a 
big issue if you are setting the partition parallelism to the number` mentioned 
above... Mind explaining it in more detail?
   > 
   > Honestly that was just a guess on my part so it may very well be that even 
with one partition per core you would see the same issue.
   > 
   > But I was thinking that with one partition per core the IO and CPU work 
are pipelined reasonably well. The table scan will do some IO, then decode the 
data and process through the rest of the pipeline. By the time the CPU work is 
required, there is no more IO in flight to block. With any repartitions though 
that would get complicated so not really sure.
   
   Thanks, got it. And as I understand, it seems the reason why executors 
underutilized is when the cpu work is required, the IO is in flight? 
   
   So we try to spawn more io tasks to improve the io parallelism. However, the 
schedule work is carried out by `tokio`, so maybe it can't reach our goal about 
improving io parallelism?
   
   For example, assume we spawn:
   - 8 cpu tasks
   - and 16 io tasks for improving io prallelism
   The ideal scenario is that:
   - 16 io tasks are scheduled and executed, fetched 16 batches, and start next 
16, yield
   - 8 tasks are scheduled, consume the current 16 batches, yield
   - 16 io tasks are scheduled, and the nexy 16 batches are ready now after, 
fetched them, and start next 16, yield
   - 8 tasks are scheduled, consume, yield
   ...
   
   However, the tokio may schedule the tasks not as our expected:
   - 8 io tasks 1 are scheduled and executed, fetched 8 batches, and start next 
8, yield
   - 8 tasks are scheduled, consume the current 8 batches, yield
   - 8 io tasks 1 are scheduled and not ready.
   - 8 io task 2 are scheduled and not ready, yet.
   ...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to