leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-668873447
@agrawaldevesh already pushed the InvertedIndex version POC. and gather some test result on TPCH 1TB Q16 It is indeed causing performance regression for single column case, as for multi column case, the perf data is as expected. I talk with @cloud-fan offline, he suggests that the regression is mainly caused by the HashMap is Inefficient compare to the Long2UnsafeRowMap and Byte2ByteMap, because they are more cpu cache friendly. Since for the following reason, I think I might need to temporary seal the PR, and maybe looking back in some future time. * As we discuss yesterday, multi-column is not that frequency used in production, as I observed, maybe less than 0.1 % * I can't guarantee after supporting the multi-column, there will be no regression for single column, Since there are no such `UnsafeRoaringBitmap` now, and if i am going to implement one, it might be to much for the reviewer to review both with the correctness and performance * Single Column Support for ShuffleHashJoinExex should be more important. But, we are still come up with a neat algorithm to fix the complicated issue, that should count for something. Let's just considered this PR as a Discussion and Memo, and maybe some day when it's ready to support multi-column, the conversation and test result in the PR might be helpful. @agrawaldevesh @viirya @maropu sorry for waste your time, and as @cloud-fan suggested, I will move to support single column ShuffleHashJoinExec. ## hashedRelation impl ### SingleColumn E2E time: 33.6s BHJ Stage time: 5.9m ## InvertedIndex impl ### SingleColumn E2E time: 40.5s BHJ Stage time: 16.3m ### TwoColumn E2E time: 59.2s BHJ Stage time: 36.2m ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
