[jira] [Comment Edited] (MAPREDUCE-7253) MR job got hanged forever when map task timeout

shilongfei (Jira) Sun, 12 Jan 2020 19:35:23 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013996#comment-17013996
 ]


shilongfei edited comment on MAPREDUCE-7253 at 1/13/20 3:34 AM:
----------------------------------------------------------------

The log of this failed map is as follows. The first attempt is scheduled at 
9:46, the resource is allocated at 9:50, the second attempt is scheduled at 
9:52, and no subsequent resources are allocated. 
{code:java}
2020-01-02 09:46:14,181 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2020-01-02 09:50:11,102 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from 
UNASSIGNED to ASSIGNED
2020-01-02 09:50:11,146 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from ASSIGNED 
to RUNNING
2020-01-02 09:52:11,944 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
{code}
 

>From the AM allocation container log, it can be seen that the first attempt to 
>the current map is assigned to the container at 9:50, and the second attempt 
>has not been
{code:java}
2020-01-02 09:46:16,377 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_000002 to 
attempt_1577272892024_639901_m_000250_0
2020-01-02 09:46:16,379 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_000003 to 
attempt_1577272892024_639901_m_000839_0
......
2020-01-02 09:50:11,095 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_016146 to 
attempt_1577272892024_639901_m_007585_0
......
{code}
 

Looking at the container application log, it can be found that after 9:52 AM 
has applied for resources, but there is no second attempt allocated to the map.
{code:java}
2020-01-02 09:52:30,084 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 3
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
{code}
 

The requested resources are allocated to reduce
{code:java}
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to 
reduce{code}
 

Because the timeout event does not carry the rescheduler flag
{code:java}
if(taskTimedOut) {
  // task is lost, remove from the list and raise lost event
  iterator.remove();
  eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(entry
      .getKey(), "AttemptID:" + entry.getKey().toString()
      + " Timed out after " + taskTimeOut / 1000 + " secs"));
  eventHandler.handle(new TaskAttemptEvent(entry.getKey(),
      TaskAttemptEventType.TA_TIMED_OUT));
}{code}


was (Author: shilongfei):
The log of this failed map is as follows. The first attempt is scheduled at 
9:46, the resource is allocated at 9:50, the second attempt is scheduled at 
9:52, and no subsequent resources are allocated.

 
{code:java}
2020-01-02 09:46:14,181 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from NEW to 
UNASSIGNED
2020-01-02 09:50:11,102 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from 
UNASSIGNED to ASSIGNED
2020-01-02 09:50:11,146 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_0 TaskAttempt Transitioned from ASSIGNED 
to RUNNING
2020-01-02 09:52:11,944 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: 
attempt_1577272892024_639901_m_007585_1 TaskAttempt Transitioned from NEW to 
UNASSIGNED
{code}
 

>From the AM allocation container log, it can be seen that the first attempt to 
>the current map is assigned to the container at 9:50, and the second attempt 
>has not been
{code:java}
2020-01-02 09:46:16,377 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_000002 to 
attempt_1577272892024_639901_m_000250_0
2020-01-02 09:46:16,379 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_000003 to 
attempt_1577272892024_639901_m_000839_0
......
2020-01-02 09:50:11,095 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned container 
container_e922_1577272892024_639901_01_016146 to 
attempt_1577272892024_639901_m_007585_0
......
{code}
Looking at the container application log, it can be found that after 9:52 AM 
has applied for resources, but there is no second attempt allocated to the map.
{code:java}
2020-01-02 09:52:30,084 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 3
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Got allocated 
containers 1
{code}
The requested resources are allocated to reduce
{code:java}
2020-01-02 09:52:31,093 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:33,110 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:52:34,119 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to reduce
2020-01-02 09:55:11,234 INFO [RMCommunicator Allocator] 
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Assigned to 
reduce{code}
 

Because the timeout event does not carry the rescheduler flag
{code:java}
if(taskTimedOut) {
  // task is lost, remove from the list and raise lost event
  iterator.remove();
  eventHandler.handle(new TaskAttemptDiagnosticsUpdateEvent(entry
      .getKey(), "AttemptID:" + entry.getKey().toString()
      + " Timed out after " + taskTimeOut / 1000 + " secs"));
  eventHandler.handle(new TaskAttemptEvent(entry.getKey(),
      TaskAttemptEventType.TA_TIMED_OUT));
}{code}

> MR job got hanged forever when map task timeout
> -----------------------------------------------
>
>                 Key: MAPREDUCE-7253
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7253
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: shilongfei
>            Priority: Major
>         Attachments: 2020-01-02 11-15-42屏幕截图.png, 2020-01-02 
> 11-15-58屏幕截图.png, 2020-01-02 11-16-14屏幕截图.png
>
>
> When the MR map job fails and the reduce job has started to be scheduled, the 
> failed map will not be scheduled, it will always be in the NEW state, and the 
> reduce job will also be blocked.
> So why was the map job not scheduled in time?  !2020-01-02 11-15-58屏幕截图.png!
> !2020-01-02 11-15-42屏幕截图.png!
> !2020-01-02 11-16-14屏幕截图.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (MAPREDUCE-7253) MR job got hanged forever when map task timeout

Reply via email to