pepijnve commented on issue #16490: URL: https://github.com/apache/datafusion/issues/16490#issuecomment-2992976261
I was just scanning through https://www.microsoft.com/en-us/research/wp-content/uploads/2024/12/Extensible-Query-Optimizers-in-Practice.pdf. Seems my intuition is not too far off > Finally, we note that while the Exchange operator can simplify the implementation of parallelism, it can still be beneficial to design and implement multi-threaded operators. For example, with Exchange operator, each thread would build its own hash table in the Hash Join operator, which could result in hash tables of variable sizes due to data skew. In contrast, if the Hash Join operator is designed to be multi-threaded, it is possible to build a single hash table that supports concurrent operations, which avoids the above issue with data skew. I think that's kind of what the morsel paper suggests as well. Avoid processing imbalance by partitioning more dynamically and compensate for data skew by coalescing in a multi-threaded build side. Easy to write that sentence, actually building it is a different matter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org