Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/20640#discussion_r196997885
--- Diff:
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
---
@@ -648,14 +645,8 @@ private[spark] class
MesosCoarseGrainedSchedulerBackend(
totalGpusAcquired -= gpus
gpusByTaskId -= taskId
}
- // If it was a failure, mark the slave as failed for blacklisting
purposes
if (TaskState.isFailed(state)) {
- slave.taskFailures += 1
-
- if (slave.taskFailures >= MAX_SLAVE_FAILURES) {
- logInfo(s"Blacklisting Mesos slave $slaveId due to too many
failures; " +
- "is Spark installed on it?")
- }
+ logError(s"Task $taskId failed on Mesos slave $slaveId.")
--- End diff --
@IgorBerman I'm not entirely sure what you mean.
yes, *eventually* I think mesos should be doing something very simliar to
whats in that PR. You can't use that immediately, because for now the other PR
is tied to yarn internals. But I don't think it would be too hard to refactor
what's there just a little bit so most of the logic could be reused.
but I think everybody just wants to get this change in, and do that in a
followup.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]