pepijnve commented on issue #16490:
URL: https://github.com/apache/datafusion/issues/16490#issuecomment-2992976261

   I was just scanning through 
https://www.microsoft.com/en-us/research/wp-content/uploads/2024/12/Extensible-Query-Optimizers-in-Practice.pdf.
 Seems my intuition is not too far off
   
   > Finally, we note that while the Exchange operator can simplify
   the implementation of parallelism, it can still be beneficial to design
   and implement multi-threaded operators. For example, with Exchange
   operator, each thread would build its own hash table in the Hash Join
   operator, which could result in hash tables of variable sizes due to
   data skew. In contrast, if the Hash Join operator is designed to be
   multi-threaded, it is possible to build a single hash table that supports
   concurrent operations, which avoids the above issue with data skew.
   
   I think that's kind of what the morsel paper suggests as well. Avoid 
processing imbalance by partitioning more dynamically and compensate for data 
skew by coalescing in a multi-threaded build side. Easy to write that sentence, 
actually building it is a different matter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to