[GitHub] [tez] abstractdog commented on a change in pull request #152: TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems

GitBox Fri, 22 Oct 2021 04:07:22 -0700


abstractdog commented on a change in pull request #152:
URL: https://github.com/apache/tez/pull/152#discussion_r734443864




##########
File path: 
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
##########
@@ -1793,80 +1793,107 @@ public TaskAttemptStateInternal 
transition(TaskAttemptImpl attempt, TaskAttemptE
   MultipleArcTransition<TaskAttemptImpl, TaskAttemptEvent, 
TaskAttemptStateInternal> {
 
     @Override
-    public TaskAttemptStateInternal transition(TaskAttemptImpl attempt,
+    public TaskAttemptStateInternal transition(TaskAttemptImpl sourceAttempt,
         TaskAttemptEvent event) {
       TaskAttemptEventOutputFailed outputFailedEvent = 
           (TaskAttemptEventOutputFailed) event;
-      TezEvent tezEvent = outputFailedEvent.getInputFailedEvent();
-      TezTaskAttemptID failedDestTaId = 
tezEvent.getSourceInfo().getTaskAttemptID();
-      InputReadErrorEvent readErrorEvent = 
(InputReadErrorEvent)tezEvent.getEvent();
+      TezEvent inputFailedEvent = outputFailedEvent.getInputFailedEvent();
+      TezTaskAttemptID failedDestTaId = 
inputFailedEvent.getSourceInfo().getTaskAttemptID();
+
+      InputReadErrorEvent readErrorEvent = 
(InputReadErrorEvent)inputFailedEvent.getEvent();
       int failedInputIndexOnDestTa = readErrorEvent.getIndex();
-      if (readErrorEvent.getVersion() != attempt.getID().getId()) {
-        throw new TezUncheckedException(attempt.getID()
+
+      if (readErrorEvent.getVersion() != sourceAttempt.getID().getId()) {
+        throw new TezUncheckedException(sourceAttempt.getID()
             + " incorrectly blamed for read error from " + failedDestTaId
             + " at inputIndex " + failedInputIndexOnDestTa + " version"
             + readErrorEvent.getVersion());
       }
-      LOG.info(attempt.getID()
-            + " blamed for read error from " + failedDestTaId
-            + " at inputIndex " + failedInputIndexOnDestTa);
-      long time = attempt.clock.getTime();
-      Long firstErrReportTime = 
attempt.uniquefailedOutputReports.get(failedDestTaId);
+      // source host: where the data input is supposed to come from
+      String sHost = sourceAttempt.getNodeId().getHost();
+      // destination: where the data is tried to be fetched to
+      String dHost = readErrorEvent.getDestinationLocalhostName();
+
+      LOG.info("{} (on {}) blamed for read error from {} (on {}) at inputIndex 
{}", sourceAttempt.getID(),
+          sHost, failedDestTaId, dHost, failedInputIndexOnDestTa);
+
+      boolean tooManyDownstreamHostsBlamedTheSameUpstreamHost = false;
+      Map<String, Set<String>> downstreamBlamingHosts = 
sourceAttempt.getVertex().getDownstreamBlamingHosts();

Review comment:
       @rbalamohan: uploaded a new patch here 
https://github.com/apache/tez/pull/152/commits/c523ebba8f14c1256cb992109c18ce023fde575c
   
   tested on cluster, I found that instead of the number of nodes, we need to 
take care of active hosts only
   it was an interesting situation where the AM was running, and I restarted 
the hive llap application, and as node removal is not implemented in 
AMNodeTracker, it turned out we need to consider only ACTIVE nodes, here is 
what I saw when I logged nodes:
   
   ```2021-10-21 05:34:53,798 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: nodes: 
{ccycloud-4.hive-runtime-perf.root.hwx.site:40783:0deaa785-cd96-45ad-ae1c-e5f9e44be73a={AMNodeImpl:
 nodeId: 
ccycloud-4.hive-runtime-perf.root.hwx.site:40783:0deaa785-cd96-45ad-ae1c-e5f9e44be73a,
 state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, 
blackListed: false}, 
ccycloud-8.hive-runtime-perf.root.hwx.site:39959:fce0bc31-a1a4-49ea-b40c-53ea84a783f6={AMNodeImpl:
 nodeId: 
ccycloud-8.hive-runtime-perf.root.hwx.site:39959:fce0bc31-a1a4-49ea-b40c-53ea84a783f6,
 state: ACTIVE, containers: 221, completed containers: 179, healthy: true, 
blackListed: false}, 
ccycloud-7.hive-runtime-perf.root.hwx.site:38523:1a2b75f4-1d87-4735-bea4-d96790ee5420={AMNodeImpl:
 nodeId: 
ccycloud-7.hive-runtime-perf.root.hwx.site:38523:1a2b75f4-1d87-4735-bea4-d96790ee5420,
 state: ACTIVE, containers: 223, completed containers: 181, healthy: true, 
blackListed: false}, ccycloud-5.hive-runtime-perf.roo
 t.hwx.site:35671:ea352491-39bb-4d81-b9c5-e519b87696b8={AMNodeImpl: nodeId: 
ccycloud-5.hive-runtime-perf.root.hwx.site:35671:ea352491-39bb-4d81-b9c5-e519b87696b8,
 state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, 
blackListed: false}, 
ccycloud-4.hive-runtime-perf.root.hwx.site:46577:d8f0caff-790e-453a-96ac-14964d660f7e={AMNodeImpl:
 nodeId: 
ccycloud-4.hive-runtime-perf.root.hwx.site:46577:d8f0caff-790e-453a-96ac-14964d660f7e,
 state: ACTIVE, containers: 236, completed containers: 194, healthy: true, 
blackListed: false}, 
ccycloud-3.hive-runtime-perf.root.hwx.site:34353:69d5de51-df04-481b-9aad-a6b88a2a33c9={AMNodeImpl:
 nodeId: 
ccycloud-3.hive-runtime-perf.root.hwx.site:34353:69d5de51-df04-481b-9aad-a6b88a2a33c9,
 state: UNHEALTHY, containers: 0, completed containers: 0, healthy: false, 
blackListed: false}, 
ccycloud-6.hive-runtime-perf.root.hwx.site:44728:187f814d-9760-4965-881f-9dac9f4a2ae5={AMNodeImpl:
 nodeId: ccycloud-6.hive-runtime-perf.root.hwx.site:44728:187f81
 4d-9760-4965-881f-9dac9f4a2ae5, state: UNHEALTHY, containers: 0, completed 
containers: 0, healthy: false, blackListed: false}, 
ccycloud-9.hive-runtime-perf.root.hwx.site:42123:dcb2622b-fed3-4f88-a4e2-70d0ebc57a5e={AMNodeImpl:
 nodeId: 
ccycloud-9.hive-runtime-perf.root.hwx.site:42123:dcb2622b-fed3-4f88-a4e2-70d0ebc57a5e,
 state: ACTIVE, containers: 230, completed containers: 188, healthy: true, 
blackListed: false}}```
   
   ```2021-10-21 05:34:53,798 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: currentNumberOfFailingDownstreamHosts: 1, numNodes: 4, 
fraction: 0.25, max allowed: 0.2```
   
   this log message was made when numNodes: 4 was showing only state:ACTIVE 
nodes, but I used a temporary log message "nodes:", you can see 4 running LLAP 
daemons and 4 old, which are shown as UNHEALTHY instead of ACTIVE




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tez] abstractdog commented on a change in pull request #152: TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems

Reply via email to