agrawaldevesh commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667822810


   > Yes, I do understand of the Paper 6.2. Basically the paper describe the 
algorithm in the perspective of StreamedSide. But the expansion state the 
perspective of BuildSide. Let's just do revert inferencing of the following 
case.
   > 
   > if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated 
as TRUE OR UNKNOWN and dropped.
   > it should be
   > (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, 
null, null) and of course (1,2,3)
   > right?
   
   First let me make sure I understand the current approach in this PR: We take 
a row and add possible null padded combinations of it to the hash table. And 
then there is almost no change to the streaming side: We look for an exact 
match in the hash table. The only other tiny change to the streaming side is 
the change from anyNull to allNull, which becomes pertinent with multiple keys. 
Is that right ?
   
   So with this consider what happens when there is a build row of (1, null, 
3). We expect the row (1, 2, 3) to NOT be returned (or 'matched' in my 
parlance). Lets see what would happen with this PR: We will expand (1, null, 3) 
into the following rows in the HT: (1, null, 3), (null, null, 3), (1, null, 
null), (null, null, null). (Btw, should (null, null, null) be even added here 
?). Unfortunately the row (1, 2, 3) does not match and is RETURNED. Similarly 
the left side row (1, 3, null) should also match and not be RETURNED but this 
PR would return it. 
   
   (Please consider adding them as tests and you will see that the BNLJ 
implementation passes them but this PR would fail them.)
   
   The PR currently will pass the negative tests I mentioned above: build row 
(1, 5, 3) will not match probe row (1, 2, 3) and nor will the build row (1, 5, 
null) match the probe row (1, 5, 7). That is, the PR will correctly NOT return 
the probe rows (1, 2, 3) nor (1, 5, 7).
   
   As for why the PR may be passing SubquerySuite: I am not sure how strong the 
coverage for SubquerySuite's multi-key not-in is. The BNLJ approach didn't have 
a special algorithm for multiple keys and thus it may not have needed as much 
attention.
   
   My guess is that we cannot just do simple HT probing on the stream side. We 
have to do something different. The original NAAJ paper calls for creating 
multiple HT's with different key combinations (depending on what's null or 
not). 
   
   > @agrawaldevesh I am finally understand the complexity of multi column 
support, thanks to your remind again and again, feel sorry about my naive. Do 
you think it still worth to carry on to support multi column? sincerely ask for 
your suggestion.
   
   Don't feel sorry. I admire you being bold and persevering with this !! This 
is how engineering proceeds, we learn when we hit brick walls. I have actually 
tried to (unsuccessfully !) implement this optimization before using a variety 
of these naive ways but shied away from implementing the multiple index 
approach because it seemed too much work. 
   
   You have come very far with this and I would totally encourage you to please 
think how to support this for the common cases of just 2 and 3 keys. I think 
creating new additional indices "on-demand" (as mentioned in step 3) when we 
encounter a particular row-type on the probe side is probably not going to work 
(it would lead to unpredictable memory behavior). 
   
   As for whether we should do this or not ... it totally depends on the 
approach. If its not too much work to support it for the special case of 2 keys 
(lets even forget three keys), we can consider it. Its a tradeoff: Customers 
can always rewrite their query from NOT-IN to NOT-EXISTS if they are not happy 
with the BNLJ performance.
   
   How about this approach (mirroring the steps 1, 2, and 3 in the paper): Step 
1 is for the exact match. For step 2, Instead of exploding the HT (ie inserting 
the null combinations there) -- just search for different null variants as 
mentioned in step 2 of the paper. ie do the different null-aware padding 
lookups instead of storing those lookups. Whereas if you encounter a left side 
row with a null key then we would be forced to do the BNLJ style "scanning" of 
the HT to search for all matching combinations (ie, if we decide to not build 
additional indices). This approach would still be more efficient than plain 
BNLJ since the NLJ penalty is only paid for left rows having a null key. 
   
   As a diversion, I wonder if it makes sense instead to support the single key 
case but for distributed scenario (shuffle hash join and like) if this 
multi-key stuff is really hard. I think the single-key distributed case would 
be more common.
   
   Thanks for working on this !!
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to