[
https://issues.apache.org/jira/browse/OOZIE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Péter Gergő Barna updated OOZIE-2847:
-------------------------------------
Attachment: OOZIE-2847.patch
> Oozie Ha timing issue
> ---------------------
>
> Key: OOZIE-2847
> URL: https://issues.apache.org/jira/browse/OOZIE-2847
> Project: Oozie
> Issue Type: Bug
> Components: HA
> Affects Versions: 4.3.0
> Reporter: Péter Gergő Barna
> Priority: Minor
> Attachments: OOZIE-2847.patch
>
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> Oozie Ha timing issue
> When Oozie is launching the mapper, it is writing a job id into a file on
> hdfs. Let's assume the ApplicationMaster is killed, and Oozie will make a
> second try, during recovery. On the second try, Oozie is trying to see if the
> previously written job id on hdfs matches the current job id. In most
> occasion, this will match. However, in the event when Oozie launcher is
> killed right in the middle when Oozie is in the process of writing id in the
> file, the Oozie file in hdfs is created, but the id has yet to be written to
> the file. During the next recovery, Oozie will mistakenly think the id exists
> in the file while the file is actually empty, therefore throwing this
> exception:
> {{monospaced}}
> 2015-07-10
> 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> 2015-07-10 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|Console URL
> : http://dal-ha21:8088/proxy/application_1436507526035_0001/
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error Code
> : JA018
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error
> Message : Hadoop job Id mismatch, action file
> [hdfs://hdp2-ha2/user/hadoopqa/oozie-hado/0000003-150710041341636-oozie-hado-W/pig-node--pig/0000003-150710041341636-oozie-hado-W@pig-node@0]
> declares Id [null] current Id [job_1436507526035_0001]
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External ID
> : job_1436507526035_0001
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External
> Status : FAILED/KILLED
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Name
> : pig-node
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Retries
> : 0
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Tracker URI
> : dal-ha21:8032
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Type
> : pig
> 2015-07-10 05:56:58,158|beaver.machine|INFO|5208|1344|MainThread|Started
> : 2015-07-10 05:55:19 GMT
> 2015-07-10 05:56:58,160|beaver.machine|INFO|5208|1344|MainThread|Status
> : ERROR
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|Ended
> : 2015-07-10 05:56:42 GMT
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External
> Stats : null
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External
> ChildIDs : null
> 2015-07-10
> 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> Exception:
> 2015-07-10 05:56:18,658 INFO [main]
> org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file
> system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,665 INFO [main]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at
> hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> 2015-07-10 05:56:18,693 WARN [main]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Unable to parse prior job
> history, aborting recovery
> java.io.IOException: Incompatible event log version: null
> at
> org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71)
> at
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.parsePreviousJobHistory(MRAppMaster.java:1206)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.processRecovery(MRAppMaster.java:1175)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1039)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1519)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
> 2015-07-10 05:56:18,737 INFO [main]
> org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file
> system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,745 INFO [main]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at
> hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> {{monospaced}}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)