[
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Zhang updated TEZ-1470:
----------------------------
Description:
TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this case
TaskAttemptFinishedEvent may been recorded 2 times,and will cause failure in
recovery.
{code}
14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService:
Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause:
org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
attempt finished, more completions than starts encountered,
taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
incompleteAttempts=-1
org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
attempt finished, more completions than starts encountered,
taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
incompleteAttempts=-1
at
org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
at
org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
at
org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
at
org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
at
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at
org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
{code}
was:
When AM is in attempt 2, sometimes TaskFinishedEvent may be recorded multiple
times, although it won't affect the recovery behavior now, track the issue
here.
Here's one example ( task_1408499461970_0035_1_00_000000 is finished 2 times
here, one is in the attempt 1 and the other is in the attempt 2 )
{code}
15:59:00,671 - Thread( main) - (RecoveryParser.java:332) - Parsing DAG recovery
file
hdfs://0.0.0.0:9000/user/jzhang/tez/staging/.tez/application_1408499461970_0035/recovery/2/dag_1408499461970_0035_1.recovery
15:59:00,716 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=DAG_SUBMITTED, event=dagID=dag_1408499461970_0035_1,
submitTime=1408515526009
15:59:00,718 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=DAG_INITIALIZED,
event=dagID=dag_1408499461970_0035_1, initTime=1408515526187
15:59:00,719 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=DAG_STARTED, event=dagID=dag_1408499461970_0035_1,
startTime=1408515526263
15:59:00,722 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_INITIALIZED, event=vertexName=reduce,
vertexId=vertex_1408499461970_0035_1_01, initRequestedTime=1408515526282,
initedTime=1408515526325, numTasks=1, processorName=null,
additionalInputsCount=0
15:59:00,725 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_INITIALIZED, event=vertexName=map,
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=1408515526265,
initedTime=1408515526331, numTasks=1, processorName=null,
additionalInputsCount=1
15:59:00,736 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED,
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,738 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED,
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,739 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_STARTED,
event=vertexId=vertex_1408499461970_0035_1_00,
startRequestedTime=1408515526332, startedTime=1408515526332
15:59:00,742 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_STARTED, event=vertexName=null,
taskId=task_1408499461970_0035_1_00_000000, scheduledTime=1408515526345,
launchTime=1408515526345
15:59:00,742 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_STARTED,
event=vertexId=vertex_1408499461970_0035_1_01,
startRequestedTime=1408515526350, startedTime=1408515526350
15:59:00,750 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null,
taskAttemptId=attempt_1408499461970_0035_1_00_000000_0,
startTime=1408515529190, containerId=container_1408499461970_0035_01_000002,
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,750 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED,
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,768 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null,
taskAttemptId=attempt_1408499461970_0035_1_00_000000_0, startTime=0,
finishTime=1408515534680, timeTaken=1408515534680, status=SUCCEEDED,
diagnostics=, counters=Counters: 16, File System Counters, FILE: BYTES_READ=32,
FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE:
WRITE_OPS=0, org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100,
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50,
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006,
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,772 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_FINISHED, event=vertexName=null,
taskId=task_1408499461970_0035_1_00_000000, startTime=0,
finishTime=1408515534688, timeTaken=1408515534688, status=SUCCEEDED,
successfulAttemptID=null, diagnostics=, counters=Counters: 16, File System
Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0,
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100,
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50,
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006,
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,774 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=map,
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=0, initedTime=0,
startRequestedTime=0, startedTime=0, finishTime=1408515534690,
timeTaken=1408515534690, status=SUCCEEDED, diagnostics=, counters=Counters: 16,
File System Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE:
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100,
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50,
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006,
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, vertexStats=null
15:59:00,775 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_STARTED, event=vertexName=null,
taskId=task_1408499461970_0035_1_01_000000, scheduledTime=1408515534698,
launchTime=1408515534698
15:59:00,775 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null,
taskAttemptId=attempt_1408499461970_0035_1_01_000000_0,
startTime=1408515534718, containerId=container_1408499461970_0035_01_000002,
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,777 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_FINISHED, event=vertexName=null,
taskId=task_1408499461970_0035_1_00_000000, startTime=0,
finishTime=1408515540309, timeTaken=1408515540309, status=SUCCEEDED,
successfulAttemptID=null, diagnostics=, counters=Counters: 16, File System
Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0,
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100,
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50,
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006,
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,778 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=map,
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=0, initedTime=0,
startRequestedTime=0, startedTime=0, finishTime=1408515540344,
timeTaken=1408515540344, status=SUCCEEDED, diagnostics=, counters=Counters: 16,
File System Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE:
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100,
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50,
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006,
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0,
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, vertexStats=null
15:59:00,779 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null,
taskAttemptId=attempt_1408499461970_0035_1_01_000000_1,
startTime=1408515542968, containerId=container_1408499461970_0035_02_000002,
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,801 - Thread( main) - (Limits.java:59) - Counter limits initialized
with parameters: GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64,
MAX_COUNTERS=1200
15:59:00,802 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null,
taskAttemptId=attempt_1408499461970_0035_1_01_000000_1, startTime=0,
finishTime=1408515553528, timeTaken=1408515553528, status=SUCCEEDED,
diagnostics=, counters=Counters: 30, File System Counters, FILE:
BYTES_READ=1010, FILE: BYTES_WRITTEN=1010, FILE: READ_OPS=0, FILE:
LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100,
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=100,
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0,
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=11, COMMITTED_HEAP_BYTES=257425408,
OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=1010,
ADDITIONAL_SPILLS_BYTES_READ=1010, SHUFFLE_BYTES=1010,
SHUFFLE_BYTES_DECOMPRESSED=1006, SHUFFLE_BYTES_TO_MEM=1010,
SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0,
Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0,
WRONG_MAP=0, WRONG_REDUCE=0
15:59:00,804 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=reduce,
vertexId=vertex_1408499461970_0035_1_01, initRequestedTime=0, initedTime=0,
startRequestedTime=0, startedTime=0, finishTime=1408515553533,
timeTaken=1408515553533, status=SUCCEEDED, diagnostics=, counters=Counters: 30,
File System Counters, FILE: BYTES_READ=1010, FILE: BYTES_WRITTEN=1010, FILE:
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100,
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=100,
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0,
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=11, COMMITTED_HEAP_BYTES=257425408,
OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=1010,
ADDITIONAL_SPILLS_BYTES_READ=1010, SHUFFLE_BYTES=1010,
SHUFFLE_BYTES_DECOMPRESSED=1006, SHUFFLE_BYTES_TO_MEM=1010,
SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0,
Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0,
WRONG_MAP=0, WRONG_REDUCE=0, vertexStats=null
15:59:00,805 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=DAG_COMMIT_STARTED,
event=dagID=dag_1408499461970_0035_1
15:59:00,810 - Thread( main) - (RecoveryParser.java:312) - Parsed event from
recovery stream, eventType=DAG_FINISHED, event=dagId=dag_1408499461970_0035_1,
startTime=0, finishTime=1408515553540, timeTaken=1408515553540,
status=SUCCEEDED, diagnostics=, counters=Counters: 36,
org.apache.tez.common.counters.DAGCounter, TOTAL_LAUNCHED_TASKS=3, File System
Counters, FILE: BYTES_READ=1042, FILE: BYTES_WRITTEN=2052, FILE: READ_OPS=0,
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0,
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100,
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=200,
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0,
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=20, COMMITTED_HEAP_BYTES=514850816,
INPUT_RECORDS_PROCESSED=50, OUTPUT_RECORDS=100, OUTPUT_BYTES=800,
OUTPUT_BYTES_WITH_OVERHEAD=1006, OUTPUT_BYTES_PHYSICAL=1010,
ADDITIONAL_SPILLS_BYTES_WRITTEN=1010, ADDITIONAL_SPILLS_BYTES_READ=1010,
ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=1010, SHUFFLE_BYTES_DECOMPRESSED=1006,
SHUFFLE_BYTES_TO_MEM=1010, SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0,
NUM_DISK_TO_DISK_MERGES=0, Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0,
WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0
15:59:00,810 - Thread( main) - (RecoveryParser.java:309) - Reached end of stream
{code}
> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for
> the same task
> ------------------------------------------------------------------------------------------
>
> Key: TEZ-1470
> URL: https://issues.apache.org/jira/browse/TEZ-1470
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Priority: Minor
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService:
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause:
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for
> attempt finished, more completions than starts encountered,
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2,
> incompleteAttempts=-1
> at
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
> at
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
> at
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
> at
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
> at
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
> at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)