Ngone51 commented on a change in pull request #28257:
URL: https://github.com/apache/spark/pull/28257#discussion_r414231747
##########
File path:
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
##########
@@ -675,11 +676,15 @@ private[spark] class TaskSchedulerImpl(
// Check whether the barrier tasks are partially launched.
// TODO SPARK-24818 handle the assert failure case (that can happen
when some locality
// requirements are not fulfilled, and we should revert the launched
tasks).
- require(addressesWithDescs.size == taskSet.numTasks,
- s"Skip current round of resource offers for barrier stage
${taskSet.stageId} " +
Review comment:
> Could we instead have a counter inside the taskSet or other mechanism
to allow for X retries?
I believe barrier retry is next step we plan to do in the future release but
not 2.4.
> It seems like turning it off bis a bit of a behaviour change from the
point of view of considering backporting.
What's behavior change? Previously, application gets hang and now it fail as
we expect in first place.
> require would have ended up throwing an exception in this case - we should
do the same after taskSet.abort to prevent change in behavior - particularly
for backport
To be honest, I'm fine to keep throwing exception there, but I disagree that
throwing exception is expected behavior we can not change. Actually, no one
would handle the exception thrown here. And I believe our expect behavior is to
fail the application with the clear error message.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]