[GitHub] [tez] abstractdog commented on a change in pull request #152: TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems

GitBox Wed, 20 Oct 2021 00:26:27 -0700


abstractdog commented on a change in pull request #152:
URL: https://github.com/apache/tez/pull/152#discussion_r732484724




##########
File path: 
tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java
##########
@@ -1793,80 +1793,107 @@ public TaskAttemptStateInternal 
transition(TaskAttemptImpl attempt, TaskAttemptE
   MultipleArcTransition<TaskAttemptImpl, TaskAttemptEvent, 
TaskAttemptStateInternal> {
 
     @Override
-    public TaskAttemptStateInternal transition(TaskAttemptImpl attempt,
+    public TaskAttemptStateInternal transition(TaskAttemptImpl sourceAttempt,
         TaskAttemptEvent event) {
       TaskAttemptEventOutputFailed outputFailedEvent = 
           (TaskAttemptEventOutputFailed) event;
-      TezEvent tezEvent = outputFailedEvent.getInputFailedEvent();
-      TezTaskAttemptID failedDestTaId = 
tezEvent.getSourceInfo().getTaskAttemptID();
-      InputReadErrorEvent readErrorEvent = 
(InputReadErrorEvent)tezEvent.getEvent();
+      TezEvent inputFailedEvent = outputFailedEvent.getInputFailedEvent();
+      TezTaskAttemptID failedDestTaId = 
inputFailedEvent.getSourceInfo().getTaskAttemptID();
+
+      InputReadErrorEvent readErrorEvent = 
(InputReadErrorEvent)inputFailedEvent.getEvent();
       int failedInputIndexOnDestTa = readErrorEvent.getIndex();
-      if (readErrorEvent.getVersion() != attempt.getID().getId()) {
-        throw new TezUncheckedException(attempt.getID()
+
+      if (readErrorEvent.getVersion() != sourceAttempt.getID().getId()) {
+        throw new TezUncheckedException(sourceAttempt.getID()
             + " incorrectly blamed for read error from " + failedDestTaId
             + " at inputIndex " + failedInputIndexOnDestTa + " version"
             + readErrorEvent.getVersion());
       }
-      LOG.info(attempt.getID()
-            + " blamed for read error from " + failedDestTaId
-            + " at inputIndex " + failedInputIndexOnDestTa);
-      long time = attempt.clock.getTime();
-      Long firstErrReportTime = 
attempt.uniquefailedOutputReports.get(failedDestTaId);
+      // source host: where the data input is supposed to come from
+      String sHost = sourceAttempt.getNodeId().getHost();
+      // destination: where the data is tried to be fetched to
+      String dHost = readErrorEvent.getDestinationLocalhostName();
+
+      LOG.info("{} (on {}) blamed for read error from {} (on {}) at inputIndex 
{}", sourceAttempt.getID(),
+          sHost, failedDestTaId, dHost, failedInputIndexOnDestTa);
+
+      boolean tooManyDownstreamHostsBlamedTheSameUpstreamHost = false;
+      Map<String, Set<String>> downstreamBlamingHosts = 
sourceAttempt.getVertex().getDownstreamBlamingHosts();

Review comment:
       @rbalamohan: please note that this map is stored on vertex level...
   ```
   sourceAttempt.getVertex().getDownstreamBlamingHosts()
   ```
   ...not on task attempt level, so the memory occupied is proportional to no. 
of vertices, do you think there is still a mem pressure problem?
   
   btw. the idea of storing on vertex level was that if we already detected a 
LOST_OUTPUT for source_attempt_0 from a source_host_0, then we immediately mark 
source_attempt_x on source_host_0 OUTPUT_LOST if an input_read_error comes in 
blaming source_attempt_x, please let me know if this makes sense
   
   
   > To be more specific, is it possible to track downstream hostnames in a set 
and use that set.size() for computing the fraction to determine if src has to 
be re-executed or not?
   
   I can try, do you have any idea in particular? "track downstream hostnames" 
means all downstream hostnames that reported input read error? also I'm 
struggling to understand how to make hosts be considerable in a fraction, I 
mean, what to divide with what...I need to think this over, please let me know 
if you have an idea




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tez] abstractdog commented on a change in pull request #152: TEZ-4338: Tez should consider node information to realize OUTPUT_LOST as early as possible - upstream(mapper) problems

Reply via email to