leanken commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-668873447


   @agrawaldevesh already pushed the InvertedIndex version POC. and gather some 
test result on TPCH 1TB Q16
   It is indeed causing performance regression for single column case, as for 
multi column case, the perf data is as expected.
   I talk with @cloud-fan offline, he suggests that the regression is mainly 
caused by the HashMap is Inefficient compare to the Long2UnsafeRowMap and 
Byte2ByteMap, because they are more cpu cache friendly.
   
   Since for the following reason, I think I might need to temporary seal the 
PR, and maybe looking back in some future time.
   
   * As we discuss yesterday, multi-column is not that frequency used in 
production, as I observed, maybe less than 0.1 %
   * I can't guarantee after supporting the multi-column, there will be no 
regression for single column, Since there are no such `UnsafeRoaringBitmap` 
now, and if i am going to implement one, it might be to much for the reviewer 
to review both with the correctness and performance
   * Single Column Support for ShuffleHashJoinExex should be more important.
   
   But, we are still come up with a neat algorithm to fix the complicated 
issue, that should count for something. Let's just considered this PR as a 
Discussion and Memo, and maybe some day when it's ready to support 
multi-column, the conversation and test result in the PR might be helpful. 
@agrawaldevesh @viirya @maropu sorry for waste your time, and as @cloud-fan 
suggested, I will move to support single column ShuffleHashJoinExec.
   
   ## hashedRelation impl
   
   ### SingleColumn
   E2E time: 33.6s
   BHJ Stage time: 5.9m
   
   ## InvertedIndex impl
   
   ### SingleColumn
   E2E time: 40.5s
   BHJ Stage time: 16.3m
   
   ### TwoColumn
   E2E time: 59.2s
   BHJ Stage time: 36.2m


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to