[GitHub] [spark] agrawaldevesh commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

GitBox Tue, 04 Aug 2020 23:11:47 -0700


agrawaldevesh commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-669001099



   @leanken ... this was a GREAT GREAT attempt and I certainly learned a ton 
from it :-P. I am curious if you ran profiled it while running the Q16 and have 
a sense of where the low hanging fruits might be ? 
   
   We can also consider the hybrid approach we discussed where we double the 
memory and keep the original HashedRelation for step 1 and 2 of the paper but 
use the inverted indices only for the step 3. That might help with the inverted 
index caused regression for the single key case. 
   
   In any case, I am totally with @cloud-fan that supporting shuffled hash join 
single key is more important. (As I also noted in my previous comment):
   
   > As a diversion, I wonder if it makes sense instead to support the single 
key case but for distributed scenario (shuffle hash join and like) if this 
multi-key stuff is really hard. I think the single-key distributed case would 
be more common.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

Reply via email to