JoshRosen commented on issue #27309: [SPARK-30598][SQL] Detect equijoins better URL: https://github.com/apache/spark/pull/27309#issuecomment-576983916 To double-check / restate my own understanding of the example query: ## For inner joins: If we had an inner join of the form ```sql SELECT * FROM t1 JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 ``` then that's effectively a cross-join (because we'll push each `c2 = 2` filter beneath its corresponding side of the join). There's not a whole lot of room to improve our join execution strategy in this case (this PR's changes wouldn't help us). ## For full outer joins: Given ```sql SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 ``` then the optimization proposed in this PR would cause us to hash partition rows according to `c2`: the records where `c2 = 2` will wind up in a single reducer / partition, while the other (non-join-matching) records would be spread among reducers according to the hash partitioning. Prior to this patch, our [only choice](https://github.com/apache/spark/blame/0bd7a3dfab41336dba2788a3d1fa3cf5b9f410d3/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L330) was to plan this query as a broadcast nested loop join (BNLJ): if we exceeded the broadcast size limit then this query would fail. As a result, this PR's changes end up raising the limit on the size of data we can query. However, I think that this change might slightly regress performance in cases where one side of the join is very small: Spark currently [doesn't support broadcast hash join for full outer joins](https://github.com/apache/spark/blame/0bd7a3dfab41336dba2788a3d1fa3cf5b9f410d3/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L113), so queries which previously _could_ fit as broadcast nested loop joins would instead become sort-merge joins. ### Sidebar: a multi-pass approach: I _think_ that it might be possible to rewrite ```sql SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 ``` as ```sql (SELECT * from t1 CROSS JOIN t2 where t1.c2 = 2 AND t2.c2 = 2) UNION ALL (SELECT t1.*, <nulls> from t1 where not (t1.c2 <=> 2)) UNION ALL (SELECT <nulls>, t2.* from t2 where not (t2.c2 <=> 2)) ``` (note the use of null-safe equals) That has the advantage of avoiding a shuffle for the non-matching rows at the cost of needing to scan each join input twice (since we don't have a great way to emit multiple output streams from a single task). This could also be helpful in case `c2` has skew for values other than `2` (since it avoids hash partitioning on a skew column). _Automatically_ picking that plan is hard without really good cost-based optimization, though. ## For left joins: AFAIK the potential drawbacks for full outer joins (loss of broadcast join in cases where data is really tiny) don't apply to left joins (since we'd still be able to plan broadcast hash joins), so it seems like this effectively raises the scale limit by giving us an alternative to broadcast nested loop join when the data is very large. ## For other join types: I haven't considered any other join types. ## Summary: For joins where columns of different tables are related via being equal to the same constant value, it looks like this PR's changes give us an alternative to BNLJ in situations where the data is very large. @peter-toth, do you have a motivating use-case / more realistic example of where this query pattern occurs? My initial feeling is that this seems like a pretty niche optimization and it's not clear whether this occurs often enough in real queries to warrant the added complexity and potential corner-cases.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
