HeartSaVioR opened a new pull request #23634: [SPARK-26187][SQL] Left/right outer join should not return outer nulls for already matched rows URL: https://github.com/apache/spark/pull/23634 ## What changes were proposed in this pull request? This patch fixes the edge case of left/right outer join described below: - row L1 and row R1 are joined at batch A - row R1 is evicted at batch B due to join and watermark condition, whereas row L1 is not evicted - row L1 is evicted at batch C When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below: ``` Checking whether the current row matches a key in the right side state, and that key has any value which satisfies the filter function when joined. If it doesn't, we know we can join with null, since there was never (including this batch) a match within the watermark period. If it does, there must have been a match at some point, so we know we can't join with null. ``` But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched. To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows. This approach avoids dealing with state versions, but end users need to discard their state if the query ran before to get correct result of left/right outer join on streaming. ## How was this patch tested? Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
