[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

GitBox Sun, 02 Aug 2020 20:22:22 -0700


leanken edited a comment on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667781762



   > > Step 2: Say there is a right (build) side row (1, null, 3). It should be 
counted as a match against a row on the left side (1, 2, 3). What makes this 
tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should 
NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, 
null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of 
these subcases ?
   > > Step 3: Consider a build row (1, 5, null), it should match the left row 
(1, null, 3). In addition, it should not match the build row (1, 5, 7). How do 
you handle these subcases ?
   > 
   > Above, when I mean "match" -- I mean that the left side would match the 
build row and WON'T be returned. Whereas with non match I mean that the left 
side would not match the build side and thus WILL be returned. We have 
different meanings for the words 'match' and 'not-match'. So please read my 
'match' == 'NAAJ should not return the left row', and conversely for non-match.
   > 
   > I would really really really encourage you to:
   > 
   > * Please reread the paper section 6.2 in its entirety many times and 
understand the above cases. I had to read it many times myself. It is very 
tricky as you pointed out.
   > * Add them as test cases comparing them with the original BNLJ 
implementation, both the negative and positive cases.
   > 
   > This is really tricky and I don't think the current implementation you 
have of expanding the hash table with a simple lookup on the stream side would 
suffice. I will also try to play around with your PR locally and run them as 
tests to convince myself. I hope I am wrong ;-).
   
   Yes, I do understand of the Paper 6.2. Basically the paper describe the 
algorithm in the perspective of StreamedSide. But the expansion state the 
perspective of BuildSide. Let's just do revert inferencing of the following 
case.
   
   if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated 
as TRUE OR UNKNOWN and dropped.
   it should be 
   (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, 
null, null) and of course (1,2,3)
   right?
   
   Only in above combination, streamedSide row will be dropped besides 
non-all-null case, right?
   Once you find a exact same record in HashedRelation include null columns, 
you dropped.
   
   ```
   if (lookupKey.allNull()) {
                 false
               } else {
                 // Anti Join: Drop the row on the streamed side if it is a 
match on the build
                 hashed.get(lookupKey) == null
               }
   ```
   
   I suppose this solution is working because it's passing all the not in cases 
in SQLQueryTestSuite.  
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] leanken edited a comment on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

Reply via email to