agrawaldevesh commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-669001099
@leanken ... this was a GREAT GREAT attempt and I certainly learned a ton from it :-P. I am curious if you ran profiled it while running the Q16 and have a sense of where the low hanging fruits might be ? We can also consider the hybrid approach we discussed where we double the memory and keep the original HashedRelation for step 1 and 2 of the paper but use the inverted indices only for the step 3. That might help with the inverted index caused regression for the single key case. In any case, I am totally with @cloud-fan that supporting shuffled hash join single key is more important. (As I also noted in my previous comment): > As a diversion, I wonder if it makes sense instead to support the single key case but for distributed scenario (shuffle hash join and like) if this multi-key stuff is really hard. I think the single-key distributed case would be more common. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
