alamb commented on PR #8802: URL: https://github.com/apache/arrow-datafusion/pull/8802#issuecomment-1883653026
> @alamb, would the current approach work well in a setting with a small number of threads and large batch sizes? The scenario that came up today when we discussed this internally was running a deep query (i.e. with a lot of operators) in such a setting. We were wondering if the pull mechanism would get disrupted if serializations would take too long in such an environment. This likely depends on what "large batch size" means and what type of disruption you are talking about. My internal mental model is that if you give the query plan `N` threads, it should be able to keep all N threads busy all of the time. Each output serializer will pull the next batch to serialize (the same thread will likely compute the input to serialize if each of the input futures is `ready`) and then serialize the batch. The CPU should be kept busy the entire time making input or serializing output and there won't be any stalls and the query won't suffer context switching overhead (due to more threads than CPUs) The larger the `target_batch_size` is 1. the greater the intermediate memory required (as each intermediate batch takes more) 2. the greater the latency may be between generating subsequent batches of output 3. the lower the CPU overhead is (as there are many things done "per batch" which gets amortized over many more rows) So I think the application needs to figure out how it wants to trade off these features > Maybe the answer is no (and we should test/measure to see what happens), but I'd love to hear your thought process on such a scenario. I am always a fan of measuring / testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
