zhongyu09 commented on a change in pull request #30998:
URL: https://github.com/apache/spark/pull/30998#discussion_r553794663



##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       I am afraid that won't work. The real materialize work runs in different 
thread for broadcast and map tasks. So the job sequence of start events might 
not be strictly the same as calling materialize().

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       @cloud-fan  You mean the error 1751 was not greater than 2000 
(AdaptiveQueryExecSuite.scala:1454) ? That's my fault, by moving the UT from 
BroadcastJoinSuite to AdaptiveQueryExecSuite, the spark conf changed to 
local[2] and so the running times are faster than before.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       Yes I know, the failure reported by @LuciferYang is easy to solve. 
   But the question is, the jobs submission order may be not correct, like the 
test in https://github.com/apache/spark/pull/31084.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       @cloud-fan  You mean `with retry it should be unlikely to fail` to solve 
the edge case? 

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       I am trying to create a new PR. Sorry for the inconvenience.

##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
       }
     }
   }
+
+  test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {

Review comment:
       We can get the stage submission time using SparkListeneer. But after 
trying serval times, the stage submission time is not stable thus the UT cannot 
always passed. I suggest to remove the UT before we completely solve the issue. 
https://github.com/apache/spark/pull/31099




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to