Re: [I] [DISCUSS] Distinguish between "performance" and "efficiency" cores [datafusion]

via GitHub Fri, 20 Jun 2025 14:26:11 -0700


pepijnve commented on issue #16490:
URL: https://github.com/apache/datafusion/issues/16490#issuecomment-2992976261

I was just scanning through
https://www.microsoft.com/en-us/research/wp-content/uploads/2024/12/Extensible-Query-Optimizers-in-Practice.pdf.
Seems my intuition is not too far off

> Finally, we note that while the Exchange operator can simplify
the implementation of parallelism, it can still be beneficial to design
and implement multi-threaded operators. For example, with Exchange
operator, each thread would build its own hash table in the Hash Join
operator, which could result in hash tables of variable sizes due to
data skew. In contrast, if the Hash Join operator is designed to be
multi-threaded, it is possible to build a single hash table that supports
concurrent operations, which avoids the above issue with data skew.

I think that's kind of what the morsel paper suggests as well. Avoid
processing imbalance by partitioning more dynamically and compensate for data
skew by coalescing in a multi-threaded build side. Easy to write that sentence,
actually building it is a different matter.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [DISCUSS] Distinguish between "performance" and "efficiency" cores [datafusion]

Reply via email to