[ 
https://issues.apache.org/jira/browse/TEZ-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039576#comment-15039576
 ] 

Jeff Zhang commented on TEZ-2968:
---------------------------------

Comments:

* duplicate message (TaskImpl.java)
{code}
+      } catch (RuntimeException e) {
+        LOG.error("Uncaught exception when trying handle event " + 
event.getType()
+            + " at current state " + oldState + " for task " + this.taskId, e);
+        internalErrorUncaughtException(event.getType(), e);
       }

+  protected void internalErrorUncaughtException(TaskEventType type, Exception 
e) {
+    LOG.error("Uncaught exception when handling event " + type + " on Task "
+        + this.taskId + " in state:"
+        + getInternalState(), e);
{code}

* Counter limitation exceeded in Task is not handled. Although most of task's 
counter is from task attempt, but it also has its own counter, it still 
possible that its task attempt is under counter limitation, but task's counter 
reach the limitation. 
{code}
  @Override
  public TezCounters getCounters() {
    TezCounters counters = new TezCounters();
    counters.incrAllCounters(this.counters);
    readLock.lock();
    try {
      TaskAttempt bestAttempt = selectBestAttempt();
      if (bestAttempt != null) {
        counters.incrAllCounters(bestAttempt.getCounters());
      }
      return counters;
    } finally {
      readLock.unlock();
    }
  }
{code}

* The same for TaskAttempt. It is still possible to get counter limit exception 
although most of its counters is from heartbeat
{code}
  @Override
  public TezCounters getCounters() {
    readLock.lock();
    try {
      reportedStatus.setLocalityCounter(this.localityCounter);
      TezCounters counters = reportedStatus.counters;
      if (counters == null) {
        counters = EMPTY_COUNTERS;
      }
      return counters;
    } finally {
      readLock.unlock();
    }
  }
{code}
* Should we add one option to allow the dag succeed even when the counter limit 
exceed (by default it could be false)

* No unit test ? Also system test also might be helpful

* getCounter will also be used in RPC call (getDAGStatus/getVertexStatus) and 
Tez UI (AMWebController), I think for PRC call the counter limit exception will 
be propagated to client, but not sure whether it will affect the tez-ui. 

> Counter limits exception causes AM to crash 
> --------------------------------------------
>
>                 Key: TEZ-2968
>                 URL: https://issues.apache.org/jira/browse/TEZ-2968
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Hitesh Shah
>            Priority: Critical
>         Attachments: TEZ-2968.1.wip.patch
>
>
> On vertex or dag completion, the counter limits exception propagates to the 
> Dispatcher and causes the AM to die. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to