leanken edited a comment on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667781762
> > Step 2: Say there is a right (build) side row (1, null, 3). It should be
counted as a match against a row on the left side (1, 2, 3). What makes this
tricky is that say say you have a build row (1, 5, 3), then (1, 5, 3) should
NOT match the probe row (1, 2, 3). But if you explode (1, 5, 3) into a (1,
null, 3) then it might incorrectly match (1, 2, 3). How do you handle both of
these subcases ?
> > Step 3: Consider a build row (1, 5, null), it should match the left row
(1, null, 3). In addition, it should not match the build row (1, 5, 7). How do
you handle these subcases ?
>
> Above, when I mean "match" -- I mean that the left side would match the
build row and WON'T be returned. Whereas with non match I mean that the left
side would not match the build side and thus WILL be returned. We have
different meanings for the words 'match' and 'not-match'. So please read my
'match' == 'NAAJ should not return the left row', and conversely for non-match.
>
> I would really really really encourage you to:
>
> * Please reread the paper section 6.2 in its entirety many times and
understand the above cases. I had to read it many times myself. It is very
tricky as you pointed out.
> * Add them as test cases comparing them with the original BNLJ
implementation, both the negative and positive cases.
>
> This is really tricky and I don't think the current implementation you
have of expanding the hash table with a simple lookup on the stream side would
suffice. I will also try to play around with your PR locally and run them as
tests to convince myself. I hope I am wrong ;-).
Yes, I do understand of the Paper 6.2. Basically the paper describe the
algorithm in the perspective of StreamedSide. But the expansion state the
perspective of BuildSide. Let's just do revert inferencing of the following
case.
if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated
as TRUE OR UNKNOWN and dropped.
it should be
(null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1,
null, null) and of course (1,2,3)
right?
Only in above combination, streamedSide row will be dropped besides
non-all-null case, right?
Once you find a exact same record in HashedRelation include null columns,
you dropped.
```
if (lookupKey.allNull()) {
false
} else {
// Anti Join: Drop the row on the streamed side if it is a
match on the build
hashed.get(lookupKey) == null
}
```
I suppose this solution is working because it's passing all the not in cases
in SQLQueryTestSuite.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]