[GitHub] [spark] ulysses-you commented on a diff in pull request #33522: [SPARK-36290][SQL] Pull out join condition

GitBox Wed, 15 Jun 2022 18:20:49 -0700


ulysses-you commented on code in PR #33522:
URL: https://github.com/apache/spark/pull/33522#discussion_r898610674



##########
sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala:
##########
@@ -1057,7 +1057,7 @@ class JoinSuite extends QueryTest with SharedSparkSession 
with AdaptiveSparkPlan
     val pythonEvals = collect(joinNode.get) {
       case p: BatchEvalPythonExec => p
     }
-    assert(pythonEvals.size == 2)
+    assert(pythonEvals.size == 4)

Review Comment:
   > Increase complex join key runs from 1 to 2 for BHJ.
   
   We can check if the poll out side can be broadcast so it should not be a 
blocker ?
   
   > It may increase the data size of shuffle. For example: the join key is: 
concat(col1, col2, col3, col4 ...).
   
   This is really a trade-off, one conservative option may be: We only poll out 
the complex keys which the inside attribute is not the final output. So we can 
avoid the extra shuffle data as far as possible, for example:
   ```sql
   SELECT a FROM t1 JOIN t2 on t1.a = t2.x + 1;
   ```
   
   And a config should be introduced for enable or disable easily.
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ulysses-you commented on a diff in pull request #33522: [SPARK-36290][SQL] Pull out join condition

Reply via email to