[ 
https://issues.apache.org/jira/browse/TEZ-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1470:
----------------------------
    Description: 
TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this case 
TaskAttemptFinishedEvent may been recorded 2 times,and will cause failure in 
recovery.

{code}
14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
attempt finished, more completions than starts encountered, 
taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
incompleteAttempts=-1
org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
attempt finished, more completions than starts encountered, 
taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
incompleteAttempts=-1
        at 
org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
        at 
org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
        at 
org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
        at 
org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
        at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
        at 
org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
        at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
{code}



  was:
When AM is in attempt 2, sometimes TaskFinishedEvent may be recorded multiple 
times, although it won't affect the recovery behavior now, track the issue 
here. 

Here's one example ( task_1408499461970_0035_1_00_000000 is finished 2 times 
here, one is in the attempt 1 and the other is in the attempt 2 )
{code}
15:59:00,671 - Thread( main) - (RecoveryParser.java:332) - Parsing DAG recovery 
file 
hdfs://0.0.0.0:9000/user/jzhang/tez/staging/.tez/application_1408499461970_0035/recovery/2/dag_1408499461970_0035_1.recovery
15:59:00,716 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=DAG_SUBMITTED, event=dagID=dag_1408499461970_0035_1, 
submitTime=1408515526009
15:59:00,718 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=DAG_INITIALIZED, 
event=dagID=dag_1408499461970_0035_1, initTime=1408515526187
15:59:00,719 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=DAG_STARTED, event=dagID=dag_1408499461970_0035_1, 
startTime=1408515526263
15:59:00,722 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_INITIALIZED, event=vertexName=reduce, 
vertexId=vertex_1408499461970_0035_1_01, initRequestedTime=1408515526282, 
initedTime=1408515526325, numTasks=1, processorName=null, 
additionalInputsCount=0
15:59:00,725 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_INITIALIZED, event=vertexName=map, 
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=1408515526265, 
initedTime=1408515526331, numTasks=1, processorName=null, 
additionalInputsCount=1
15:59:00,736 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED, 
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,738 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED, 
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,739 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_STARTED, 
event=vertexId=vertex_1408499461970_0035_1_00, 
startRequestedTime=1408515526332, startedTime=1408515526332
15:59:00,742 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_STARTED, event=vertexName=null, 
taskId=task_1408499461970_0035_1_00_000000, scheduledTime=1408515526345, 
launchTime=1408515526345
15:59:00,742 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_STARTED, 
event=vertexId=vertex_1408499461970_0035_1_01, 
startRequestedTime=1408515526350, startedTime=1408515526350
15:59:00,750 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null, 
taskAttemptId=attempt_1408499461970_0035_1_00_000000_0, 
startTime=1408515529190, containerId=container_1408499461970_0035_01_000002, 
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,750 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_DATA_MOVEMENT_EVENTS_GENERATED, 
event=vertexId=vertex_1408499461970_0035_1_00, eventCount=1
15:59:00,768 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
taskAttemptId=attempt_1408499461970_0035_1_00_000000_0, startTime=0, 
finishTime=1408515534680, timeTaken=1408515534680, status=SUCCEEDED, 
diagnostics=, counters=Counters: 16, File System Counters, FILE: BYTES_READ=32, 
FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: 
WRITE_OPS=0, org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100, 
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50, 
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006, 
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,772 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_FINISHED, event=vertexName=null, 
taskId=task_1408499461970_0035_1_00_000000, startTime=0, 
finishTime=1408515534688, timeTaken=1408515534688, status=SUCCEEDED, 
successfulAttemptID=null, diagnostics=, counters=Counters: 16, File System 
Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0, 
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100, 
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50, 
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006, 
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,774 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=map, 
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=0, initedTime=0, 
startRequestedTime=0, startedTime=0, finishTime=1408515534690, 
timeTaken=1408515534690, status=SUCCEEDED, diagnostics=, counters=Counters: 16, 
File System Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: 
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100, 
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50, 
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006, 
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, vertexStats=null
15:59:00,775 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_STARTED, event=vertexName=null, 
taskId=task_1408499461970_0035_1_01_000000, scheduledTime=1408515534698, 
launchTime=1408515534698
15:59:00,775 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null, 
taskAttemptId=attempt_1408499461970_0035_1_01_000000_0, 
startTime=1408515534718, containerId=container_1408499461970_0035_01_000002, 
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,777 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_FINISHED, event=vertexName=null, 
taskId=task_1408499461970_0035_1_00_000000, startTime=0, 
finishTime=1408515540309, timeTaken=1408515540309, status=SUCCEEDED, 
successfulAttemptID=null, diagnostics=, counters=Counters: 16, File System 
Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: READ_OPS=0, 
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100, 
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50, 
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006, 
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0
15:59:00,778 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=map, 
vertexId=vertex_1408499461970_0035_1_00, initRequestedTime=0, initedTime=0, 
startRequestedTime=0, startedTime=0, finishTime=1408515540344, 
timeTaken=1408515540344, status=SUCCEEDED, diagnostics=, counters=Counters: 16, 
File System Counters, FILE: BYTES_READ=32, FILE: BYTES_WRITTEN=1042, FILE: 
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, SPILLED_RECORDS=100, 
GC_TIME_MILLIS=9, COMMITTED_HEAP_BYTES=257425408, INPUT_RECORDS_PROCESSED=50, 
OUTPUT_RECORDS=100, OUTPUT_BYTES=800, OUTPUT_BYTES_WITH_OVERHEAD=1006, 
OUTPUT_BYTES_PHYSICAL=1010, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, vertexStats=null
15:59:00,779 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_ATTEMPT_STARTED, event=vertexName=null, 
taskAttemptId=attempt_1408499461970_0035_1_01_000000_1, 
startTime=1408515542968, containerId=container_1408499461970_0035_02_000002, 
nodeId=192.168.3.4:50974, inProgressLogs=null, completedLogs=null
15:59:00,801 - Thread( main) - (Limits.java:59) - Counter limits initialized 
with parameters:  GROUP_NAME_MAX=128, MAX_GROUPS=500, COUNTER_NAME_MAX=64, 
MAX_COUNTERS=1200
15:59:00,802 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=TASK_ATTEMPT_FINISHED, event=vertexName=null, 
taskAttemptId=attempt_1408499461970_0035_1_01_000000_1, startTime=0, 
finishTime=1408515553528, timeTaken=1408515553528, status=SUCCEEDED, 
diagnostics=, counters=Counters: 30, File System Counters, FILE: 
BYTES_READ=1010, FILE: BYTES_WRITTEN=1010, FILE: READ_OPS=0, FILE: 
LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100, 
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=100, 
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, 
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=11, COMMITTED_HEAP_BYTES=257425408, 
OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=1010, 
ADDITIONAL_SPILLS_BYTES_READ=1010, SHUFFLE_BYTES=1010, 
SHUFFLE_BYTES_DECOMPRESSED=1006, SHUFFLE_BYTES_TO_MEM=1010, 
SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, 
Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, 
WRONG_MAP=0, WRONG_REDUCE=0
15:59:00,804 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=VERTEX_FINISHED, event=vertexName=reduce, 
vertexId=vertex_1408499461970_0035_1_01, initRequestedTime=0, initedTime=0, 
startRequestedTime=0, startedTime=0, finishTime=1408515553533, 
timeTaken=1408515553533, status=SUCCEEDED, diagnostics=, counters=Counters: 30, 
File System Counters, FILE: BYTES_READ=1010, FILE: BYTES_WRITTEN=1010, FILE: 
READ_OPS=0, FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100, 
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=100, 
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, 
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=11, COMMITTED_HEAP_BYTES=257425408, 
OUTPUT_RECORDS=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=1010, 
ADDITIONAL_SPILLS_BYTES_READ=1010, SHUFFLE_BYTES=1010, 
SHUFFLE_BYTES_DECOMPRESSED=1006, SHUFFLE_BYTES_TO_MEM=1010, 
SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0, NUM_DISK_TO_DISK_MERGES=0, 
Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, WRONG_LENGTH=0, 
WRONG_MAP=0, WRONG_REDUCE=0, vertexStats=null
15:59:00,805 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=DAG_COMMIT_STARTED, 
event=dagID=dag_1408499461970_0035_1
15:59:00,810 - Thread( main) - (RecoveryParser.java:312) - Parsed event from 
recovery stream, eventType=DAG_FINISHED, event=dagId=dag_1408499461970_0035_1, 
startTime=0, finishTime=1408515553540, timeTaken=1408515553540, 
status=SUCCEEDED, diagnostics=, counters=Counters: 36, 
org.apache.tez.common.counters.DAGCounter, TOTAL_LAUNCHED_TASKS=3, File System 
Counters, FILE: BYTES_READ=1042, FILE: BYTES_WRITTEN=2052, FILE: READ_OPS=0, 
FILE: LARGE_READ_OPS=0, FILE: WRITE_OPS=0, 
org.apache.tez.common.counters.TaskCounter, REDUCE_INPUT_GROUPS=100, 
REDUCE_INPUT_RECORDS=100, COMBINE_INPUT_RECORDS=0, SPILLED_RECORDS=200, 
NUM_SHUFFLED_INPUTS=1, NUM_SKIPPED_INPUTS=0, NUM_FAILED_SHUFFLE_INPUTS=0, 
MERGED_MAP_OUTPUTS=1, GC_TIME_MILLIS=20, COMMITTED_HEAP_BYTES=514850816, 
INPUT_RECORDS_PROCESSED=50, OUTPUT_RECORDS=100, OUTPUT_BYTES=800, 
OUTPUT_BYTES_WITH_OVERHEAD=1006, OUTPUT_BYTES_PHYSICAL=1010, 
ADDITIONAL_SPILLS_BYTES_WRITTEN=1010, ADDITIONAL_SPILLS_BYTES_READ=1010, 
ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=1010, SHUFFLE_BYTES_DECOMPRESSED=1006, 
SHUFFLE_BYTES_TO_MEM=1010, SHUFFLE_BYTES_TO_DISK=0, NUM_MEM_TO_DISK_MERGES=0, 
NUM_DISK_TO_DISK_MERGES=0, Shuffle Errors, BAD_ID=0, CONNECTION=0, IO_ERROR=0, 
WRONG_LENGTH=0, WRONG_MAP=0, WRONG_REDUCE=0
15:59:00,810 - Thread( main) - (RecoveryParser.java:309) - Reached end of stream
{code}


> Recovery fail due to TaskAttemptFinishedEvent is recorded multiple times for 
> the same task
> ------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1470
>                 URL: https://issues.apache.org/jira/browse/TEZ-1470
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>            Priority: Minor
>
> TaskAttempt can move from SUCCEEDED to KILLED due to node failure. In this 
> case TaskAttemptFinishedEvent may been recorded 2 times,and will cause 
> failure in recovery.
> {code}
> 14-05-16 08:07:18,386 INFO [main] org.apache.hadoop.service.AbstractService: 
> Service org.apache.tez.dag.app.DAGAppMaster failed in state STARTED; cause: 
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
> org.apache.tez.dag.api.TezUncheckedException: Invalid recovery event for 
> attempt finished, more completions than starts encountered, 
> taskId=task_1400226928057_0001_1_05_000005, finishedAttempts=2, 
> incompleteAttempts=-1
>       at 
> org.apache.tez.dag.app.dag.impl.TaskImpl.restoreFromEvent(TaskImpl.java:592)
>       at 
> org.apache.tez.dag.app.RecoveryParser.parseRecoveryData(RecoveryParser.java:814)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.recoverDAG(DAGAppMaster.java:1529)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.serviceStart(DAGAppMaster.java:1558)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at org.apache.tez.dag.app.DAGAppMaster$5.run(DAGAppMaster.java:1957)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
>       at 
> org.apache.tez.dag.app.DAGAppMaster.initAndStartAppMaster(DAGAppMaster.java:1953)
>       at org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1792)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to