cloud-fan commented on pull request #35047: URL: https://github.com/apache/spark/pull/35047#issuecomment-1007198645
Since this introduces overhead (rebuild hash relation, more memory), I think we need to carefully make sure the benefit is larger than the overhead. Asking users to tune the config is really not a good way to roll out this optimization. Some random ideas: 1. look at the ndv of join keys. If there are very few duplicated keys, don't do this optimization 2. doing it at the executor side, in a dynamic way. If we detect that a key is consistently looked up and has many values, compact the values of this key and put it to a new fast map. We can even set an upper bound of the number of keys to optimize, to avoid taking too much memory. We don't need to rewrite the entire hash relation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
