zhongyu09 commented on a change in pull request #30998:
URL: https://github.com/apache/spark/pull/30998#discussion_r553794663
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
I am afraid that won't work. The real materialize work runs in different
thread for broadcast and map tasks. So the job sequence of start events might
not be strictly the same as calling materialize().
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
@cloud-fan You mean the error 1751 was not greater than 2000
(AdaptiveQueryExecSuite.scala:1454) ? That's my fault, by moving the UT from
BroadcastJoinSuite to AdaptiveQueryExecSuite, the spark conf changed to
local[2] and so the running times are faster than before.
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
Yes I know, the failure reported by @LuciferYang is easy to solve.
But the question is, the jobs submission order may be not correct, like the
test in https://github.com/apache/spark/pull/31084.
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
@cloud-fan You mean `with retry it should be unlikely to fail` to solve
the edge case?
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
I am trying to create a new PR. Sorry for the inconvenience.
##########
File path:
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
##########
@@ -1431,4 +1431,28 @@ class AdaptiveQueryExecSuite
}
}
}
+
+ test("SPARK-33933: AQE broadcast should not timeout with slow map tasks") {
Review comment:
We can get the stage submission time using SparkListeneer. But after
trying serval times, the stage submission time is not stable thus the UT cannot
always passed. I suggest to remove the UT before we completely solve the issue.
https://github.com/apache/spark/pull/31099
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]