agrawaldevesh commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667773518


   > Step 2: Say there is a right (build) side row (1, null, 3). It should be 
counted as a match against a row on the left side (1, 2, 3). What makes this 
tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should 
NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, 
null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of 
these subcases ?
   Step 3: Consider a build row (1, 5, null), it should match the left row (1, 
null, 3). In addition, it should not match the build row (1, 5, 7). How do you 
handle these subcases ?
   
   Above, when I mean "match" -- I mean that the left side would match the 
build row and WON'T be returned. Whereas with non match I mean that the left 
side would not match the build side and thus WILL be returned. We have 
different meanings for the words 'match' and 'not-match'. So please read my 
'match' == 'NAAJ should not return the left row', and conversely for non-match.
   
   I would really really really encourage you to:
   - Please reread the paper section 6.2 in its entirety many times and 
understand the above cases. I had to read it many times myself. It is very 
tricky as you pointed out.
   - Add them as test cases comparing them with the original BNLJ 
implementation, both the negative and positive cases. 
   
   This is really tricky and I don't think the current implementation you have 
of expanding the hash table with a simple lookup on the stream side would 
suffice. I will also try to play around with your PR locally and run them as 
tests to convince myself. I hope I am wrong ;-).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to