jteagles commented on a change in pull request #37: TEZ-4042: Speculative
attempts should avoid running on the same node
URL: https://github.com/apache/tez/pull/37#discussion_r259218826
##########
File path:
tez-dag/src/main/java/org/apache/tez/dag/app/rm/DagAwareYarnTaskScheduler.java
##########
@@ -567,8 +568,9 @@ private void informAppAboutAssignments(List<Assignment>
assignments) {
* @param container the container assigned to the task
*/
private void informAppAboutAssignment(TaskRequest request, Container
container) {
- if (blacklistedNodes.contains(container.getNodeId())) {
- Object task = request.getTask();
+ Object task = request.getTask();
+ if (blacklistedNodes.contains(container.getNodeId())
+ || task instanceof TaskAttempt && ((TaskAttempt)
task).getUnhealthyNodesHistory().contains(container.getNodeId())) {
Review comment:
I think informAppAboutAssignment does avoid scheduling the speculative
attempt on the nodes already running tasks, so that is good. However, it has
the consequence of deallocating free containers on nodes running attempts that
have been speculated. If we knew exactly that the node was slow, we could treat
the node as unhealthy. But choosing a tasks for speculation is just a
reasonable guess with many false positives.
Instead would it work if the check was made in tryAssignReuseContainer,
tryAssignNewContainer, tryAssignTaskToIdleContainer? With the check made early,
we can prevent deallocating containers.
In the future, I can see passing the node to avoid along with the AMRMClient
when requesting new containers to prevent requesting a node to avoid for a
speculative task attempt. It may be possible to do that now, but I have not
checked.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services