[GitHub] [spark] agrawaldevesh commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

GitBox Sun, 02 Aug 2020 10:56:13 -0700


agrawaldevesh commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667705270



   > let's say there is a record
   > 
   > (1, null, 3) in probe side, if there is a (1,2,3) in build side, it's 
counted as `MATCH` in comparison. basically if i want to avoid 0(M*N) which is 
loop look up in build side, i will have to expand (1,2,3) with all combination 
null padding new records like
   > 
   > Original key expand to 2^3 -1 = 7X keys, and we can use probe side record 
(1, null, 3) to just directly hash loop up with such data duplication. I don't 
know if I make it clean for you @agrawaldevesh , it is a bit hard for me to 
explain in english. ^_^
   > 
   > (1, 2,3 )
   > (null, 2, 3)
   > (1, null, 3)
   > (1, 2, null)
   > (null, null, 3)
   > (null, 2, null)
   > (1, null, null)
   
   Let's consider for both steps 2 and 3 of section 6.2 in the NAAJ separately:
   
   - Step 2: Say there is a right (build) side row (1, null, 3). It should be 
counted as a match against a row on the left side (1, 2, 3). What makes this 
tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should 
NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, 
null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of 
these subcases ?
   
   - Step 3: Consider a build row (1, 5, null), it should match the left row 
(1, null, 3). In addition, it should not match the build row (1, 5, 7). How do 
you handle these subcases ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

Reply via email to