[GitHub] [spark] cloud-fan commented on pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

GitBox Thu, 06 Jan 2022 23:32:21 -0800


cloud-fan commented on pull request #35047:
URL: https://github.com/apache/spark/pull/35047#issuecomment-1007198645



   Since this introduces overhead (rebuild hash relation, more memory), I think 
we need to carefully make sure the benefit is larger than the overhead. Asking 
users to tune the config is really not a good way to roll out this optimization.
   
   Some random ideas:
   1. look at the ndv of join keys. If there are very few duplicated keys, 
don't do this optimization
   2. doing it at the executor side, in a dynamic way. If we detect that a key 
is consistently looked up and has many values, compact the values of this key 
and put it to a new fast map. We can even set an upper bound of the number of 
keys to optimize, to avoid taking too much memory. We don't need to rewrite the 
entire hash relation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

Reply via email to