alamb commented on PR #8802:
URL: 
https://github.com/apache/arrow-datafusion/pull/8802#issuecomment-1883653026

   > @alamb, would the current approach work well in a setting with a small 
number of threads and large batch sizes? The scenario that came up today when 
we discussed this internally was running a deep query (i.e. with a lot of 
operators) in such a setting. We were wondering if the pull mechanism would get 
disrupted if serializations would take too long in such an environment.
   
   This likely depends on what "large batch size" means and what type of 
disruption you are talking about. 
   
   My internal mental model is that if you give the query plan `N` threads, it 
should be able to keep all N threads busy all of the time. Each output 
serializer will pull the next batch to serialize (the same thread will likely 
compute the input to serialize if each of the input futures is `ready`) and 
then serialize the batch. The CPU should be kept busy the entire time making 
input or serializing output and there won't be any stalls and the query won't 
suffer context switching overhead (due to more threads than CPUs)
   
   The larger the `target_batch_size` is 
   1. the greater the intermediate memory required (as each intermediate batch 
takes more)
   2. the greater the latency may be between generating subsequent batches of 
output
   3. the lower the CPU overhead is (as there are many things done "per batch" 
which gets amortized over many more rows)
   
   So I think the application needs to figure out how it wants to trade off 
these features
   
   > Maybe the answer is no (and we should test/measure to see what happens), 
but I'd love to hear your thought process on such a scenario.
   
   I am always a fan of measuring / testing.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to