Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/21068#discussion_r185539768
--- Diff:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
---
@@ -602,8 +569,7 @@ private[yarn] class YarnAllocator(
completedContainer.getDiagnostics,
PMEM_EXCEEDED_PATTERN))
case _ =>
- // Enqueue the timestamp of failed executor
- failedExecutorsTimeStamps.enqueue(clock.getTimeMillis())
+
allocatorBlacklistTracker.handleResourceAllocationFailure(hostOpt)
--- End diff --
are we actually totally sure this is an allocation failure? Its really any
failure which isn't covered by the cases above, right? For example, it could
be somebody logging in to the box and running `kill -9` on the executor (people
do weird things), or sudden hardware failure, etc. I think the handling is
still fine, but I'd add a comment here explaining a bit.
If we *really* want to separate out allocation failures, we'd need the
handling to be a little more complex, we'd have to keep track of executors that
have registered with the driver.
@tgravescs sound ok?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]