agrawaldevesh commented on pull request #29304: URL: https://github.com/apache/spark/pull/29304#issuecomment-667822810
> Yes, I do understand of the Paper 6.2. Basically the paper describe the algorithm in the perspective of StreamedSide. But the expansion state the perspective of BuildSide. Let's just do revert inferencing of the following case. > > if buildSide exist a row (1,2,3), what data in StreamedSide will evaluated as TRUE OR UNKNOWN and dropped. > it should be > (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) and of course (1,2,3) > right? First let me make sure I understand the current approach in this PR: We take a row and add possible null padded combinations of it to the hash table. And then there is almost no change to the streaming side: We look for an exact match in the hash table. The only other tiny change to the streaming side is the change from anyNull to allNull, which becomes pertinent with multiple keys. Is that right ? So with this consider what happens when there is a build row of (1, null, 3). We expect the row (1, 2, 3) to NOT be returned (or 'matched' in my parlance). Lets see what would happen with this PR: We will expand (1, null, 3) into the following rows in the HT: (1, null, 3), (null, null, 3), (1, null, null), (null, null, null). (Btw, should (null, null, null) be even added here ?). Unfortunately the row (1, 2, 3) does not match and is RETURNED. Similarly the left side row (1, 3, null) should also match and not be RETURNED but this PR would return it. (Please consider adding them as tests and you will see that the BNLJ implementation passes them but this PR would fail them.) The PR currently will pass the negative tests I mentioned above: build row (1, 5, 3) will not match probe row (1, 2, 3) and nor will the build row (1, 5, null) match the probe row (1, 5, 7). That is, the PR will correctly NOT return the probe rows (1, 2, 3) nor (1, 5, 7). As for why the PR may be passing SubquerySuite: I am not sure how strong the coverage for SubquerySuite's multi-key not-in is. The BNLJ approach didn't have a special algorithm for multiple keys and thus it may not have needed as much attention. My guess is that we cannot just do simple HT probing on the stream side. We have to do something different. The original NAAJ paper calls for creating multiple HT's with different key combinations (depending on what's null or not). > @agrawaldevesh I am finally understand the complexity of multi column support, thanks to your remind again and again, feel sorry about my naive. Do you think it still worth to carry on to support multi column? sincerely ask for your suggestion. Don't feel sorry. I admire you being bold and persevering with this !! This is how engineering proceeds, we learn when we hit brick walls. I have actually tried to (unsuccessfully !) implement this optimization before using a variety of these naive ways but shied away from implementing the multiple index approach because it seemed too much work. You have come very far with this and I would totally encourage you to please think how to support this for the common cases of just 2 and 3 keys. I think creating new additional indices "on-demand" (as mentioned in step 3) when we encounter a particular row-type on the probe side is probably not going to work (it would lead to unpredictable memory behavior). As for whether we should do this or not ... it totally depends on the approach. If its not too much work to support it for the special case of 2 keys (lets even forget three keys), we can consider it. Its a tradeoff: Customers can always rewrite their query from NOT-IN to NOT-EXISTS if they are not happy with the BNLJ performance. How about this approach (mirroring the steps 1, 2, and 3 in the paper): Step 1 is for the exact match. For step 2, Instead of exploding the HT (ie inserting the null combinations there) -- just search for different null variants as mentioned in step 2 of the paper. ie do the different null-aware padding lookups instead of storing those lookups. Whereas if you encounter a left side row with a null key then we would be forced to do the BNLJ style "scanning" of the HT to search for all matching combinations (ie, if we decide to not build additional indices). This approach would still be more efficient than plain BNLJ since the NLJ penalty is only paid for left rows having a null key. As a diversion, I wonder if it makes sense instead to support the single key case but for distributed scenario (shuffle hash join and like) if this multi-key stuff is really hard. I think the single-key distributed case would be more common. Thanks for working on this !! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
