leanken commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667701334


   > > Correct me if I am wrong here but I don't see any special handling for 
null keys on the left (probe) side. The changes to HashedRelation only account 
for the right side handling.
   > > ie, You seem to have implemented the step 2 in section 6.2 of the NAAJ 
paper but I don't see where step 3 is implemented ? That one seems more 
trickier since it needs to wild card matches.
   > 
   > in fact, the reason i am doing now to expand data in build side, it's 
mainly to just avoid handling null values in probe side.
   > 
   > 
![image](https://user-images.githubusercontent.com/17242071/89128138-0bb36080-d526-11ea-825a-ac7a3a838a18.png)
   > 
   > let's say there is a record
   > 
   > (1, null, 3) in probe side, if there is a (1,2,3) in build side, it's 
counted as `MATCH` in comparison. basically if i want to avoid 0(M*N) which is 
loop look up in build side, i will have to expand (1,2,3) with all combination 
null padding new records like
   > 
   > Original key expand to 2^3 -1 = 7X keys, and we can use probe side record 
(1, null, 3) to just directly hash loop up with such data duplication. I don't 
know if I make it clean for you @agrawaldevesh , it is a bit hard for me to 
explain in english. ^_^
   > 
   > (1, 2,3 )
   > (null, 2, 3)
   > (1, null, 3)
   > (1, 2, null)
   > (null, null, 3)
   > (null, 2, null)
   > (1, null, null)
   
   basically, if there is any null columns in probe side keys, it means 
ignoring the null keys values, and using the rest non-null column to try match 
in corresponding columns in buildSide. since in streamedSide rows, in can be 
all kinds of null position combination existing, i can not pre-build according 
to any single combination, so i have to expand to all kinds of combination with 
null padding .


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to