[ https://issues.apache.org/jira/browse/OOZIE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denes Bodo reassigned OOZIE-2847: --------------------------------- Assignee: Denes Bodo > Oozie Ha timing issue > --------------------- > > Key: OOZIE-2847 > URL: https://issues.apache.org/jira/browse/OOZIE-2847 > Project: Oozie > Issue Type: Bug > Components: HA > Affects Versions: 4.3.0 > Reporter: Péter Gergő Barna > Assignee: Denes Bodo > Priority: Minor > Attachments: OOZIE-2847.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Oozie Ha timing issue > When Oozie is launching the mapper, it is writing a job id into a file on > hdfs. Let's assume the ApplicationMaster is killed, and Oozie will make a > second try, during recovery. On the second try, Oozie is trying to see if the > previously written job id on hdfs matches the current job id. In most > occasion, this will match. However, in the event when Oozie launcher is > killed right in the middle when Oozie is in the process of writing id in the > file, the Oozie file in hdfs is created, but the id has yet to be written to > the file. During the next recovery, Oozie will mistakenly think the id exists > in the file while the file is actually empty, therefore throwing this > exception: > {noformat} > 2015-07-10 > 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------ > 2015-07-10 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|Console URL > : http://dal-ha21:8088/proxy/application_1436507526035_0001/ > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error Code > : JA018 > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error > Message : Hadoop job Id mismatch, action file > [hdfs://hdp2-ha2/user/hadoopqa/oozie-hado/0000003-150710041341636-oozie-hado-W/pig-node--pig/0000003-150710041341636-oozie-hado-W@pig-node@0] > declares Id [null] current Id [job_1436507526035_0001] > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External ID > : job_1436507526035_0001 > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External > Status : FAILED/KILLED > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Name > : pig-node > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Retries > : 0 > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Tracker URI > : dal-ha21:8032 > 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Type > : pig > 2015-07-10 05:56:58,158|beaver.machine|INFO|5208|1344|MainThread|Started > : 2015-07-10 05:55:19 GMT > 2015-07-10 05:56:58,160|beaver.machine|INFO|5208|1344|MainThread|Status > : ERROR > 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|Ended > : 2015-07-10 05:56:42 GMT > 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External > Stats : null > 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External > ChildIDs : null > 2015-07-10 > 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------ > Exception: > 2015-07-10 05:56:18,658 INFO [main] > org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file > system [hdfs://hdp2-ha2:8020] > 2015-07-10 05:56:18,665 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at > hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist > 2015-07-10 05:56:18,693 WARN [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Unable to parse prior job > history, aborting recovery > java.io.IOException: Incompatible event log version: null > at > org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.parsePreviousJobHistory(MRAppMaster.java:1206) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.processRecovery(MRAppMaster.java:1175) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1039) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1519) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448) > 2015-07-10 05:56:18,737 INFO [main] > org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file > system [hdfs://hdp2-ha2:8020] > 2015-07-10 05:56:18,745 INFO [main] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at > hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)