leanken commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667781762
> > Step 2: Say there is a right (build) side row (1, null, 3). It should be counted as a match against a row on the left side (1, 2, 3). What makes this tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1, null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of these subcases ? > > Step 3: Consider a build row (1, 5, null), it should match the left row (1, null, 3). In addition, it should not match the build row (1, 5, 7). How do you handle these subcases ? > > Above, when I mean "match" -- I mean that the left side would match the build row and WON'T be returned. Whereas with non match I mean that the left side would not match the build side and thus WILL be returned. We have different meanings for the words 'match' and 'not-match'. So please read my 'match' == 'NAAJ should not return the left row', and conversely for non-match. > > I would really really really encourage you to: > > * Please reread the paper section 6.2 in its entirety many times and understand the above cases. I had to read it many times myself. It is very tricky as you pointed out. > * Add them as test cases comparing them with the original BNLJ implementation, both the negative and positive cases. > > This is really tricky and I don't think the current implementation you have of expanding the hash table with a simple lookup on the stream side would suffice. I will also try to play around with your PR locally and run them as tests to convince myself. I hope I am wrong ;-). Yes, I do understand of the Paper 6.2. Basically the paper describe the algorithm in the perspective of StreamedSide. But the expansion state the perspective of BuildSide. Let's just do revert inferencing of the following case. if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated as TRUE OR UNKNOWN and dropped. it should be (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) and of course (1,2,3) right? Only in above combination, streamedSide row will be dropped besides non-all-null case, right? I suppose this solution is working because it's passing all the not in cases in SQLQueryTestSuite. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
