JoshRosen commented on issue #27309: [SPARK-30598][SQL] Detect equijoins better
URL: https://github.com/apache/spark/pull/27309#issuecomment-576983916
 
 
   To double-check / restate my own understanding of the example query:
   
   ## For inner joins:
   
   If we had an inner join of the form
   
   ```sql
   SELECT * FROM t1 JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2
   ```
   
   then that's effectively a cross-join (because we'll push each `c2 = 2` 
filter beneath its corresponding side of the join). There's not a whole lot of 
room to improve our join execution strategy in this case (this PR's changes 
wouldn't help us).
   
   ## For full outer joins:
   
   Given
   
   ```sql
   SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2
   ```
   
   then the optimization proposed in this PR would cause us to hash partition 
rows according to `c2`: the records where `c2 = 2` will wind up in a single 
reducer / partition, while the other (non-join-matching) records would be 
spread among reducers according to the hash partitioning.
   
   Prior to this patch, our [only 
choice](https://github.com/apache/spark/blame/0bd7a3dfab41336dba2788a3d1fa3cf5b9f410d3/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L330)
 was to plan this query as a broadcast nested loop join (BNLJ): if we exceeded 
the broadcast size limit then this query would fail. As a result, this PR's 
changes end up raising the limit on the size of data we can query.
   
   However, I think that this change might slightly regress performance in 
cases where one side of the join is very small: Spark currently [doesn't 
support broadcast hash join for full outer 
joins](https://github.com/apache/spark/blame/0bd7a3dfab41336dba2788a3d1fa3cf5b9f410d3/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L113),
 so queries which previously _could_ fit as broadcast nested loop joins would 
instead become sort-merge joins.
   
   ### Sidebar: a multi-pass approach:
   
   I _think_ that it might be possible to rewrite 
   
   ```sql
   SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2
   ```
   as
   
   ```sql
   (SELECT * from t1 CROSS JOIN t2 where t1.c2 = 2 AND t2.c2 = 2)
   UNION ALL
   (SELECT t1.*, <nulls> from t1 where not (t1.c2 <=> 2))
   UNION ALL
   (SELECT <nulls>, t2.* from t2 where not (t2.c2 <=> 2))
   ```
   
   (note the use of null-safe equals)
   
   That has the advantage of avoiding a shuffle for the non-matching rows at 
the cost of needing to scan each join input twice (since we don't have a great 
way to emit multiple output streams from a single task). This could also be 
helpful in case `c2` has skew for values other than `2` (since it avoids hash 
partitioning on a skew column).
   
   _Automatically_ picking that plan is hard without really good cost-based 
optimization, though. 
   
   ## For left joins:
   
   AFAIK the potential drawbacks for full outer joins (loss of broadcast join 
in cases where data is really tiny) don't apply to left joins (since we'd still 
be able to plan broadcast hash joins), so it seems like this effectively raises 
the scale limit by giving us an alternative to broadcast nested loop join when 
the data is very large.
   
   ## For other join types:
   
   I haven't considered any other join types.
   
   ## Summary:
   
   For joins where columns of different tables are related via being equal to 
the same constant value, it looks like this PR's changes give us an alternative 
to BNLJ in situations where the data is very large. 
   
   @peter-toth, do you have a motivating use-case / more realistic example of 
where this query pattern occurs? My initial feeling is that this seems like a 
pretty niche optimization and it's not clear whether this occurs often enough 
in real queries to warrant the added complexity and potential corner-cases.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to