[ 
https://issues.apache.org/jira/browse/OOZIE-2847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352507#comment-16352507
 ] 

Denes Bodo commented on OOZIE-2847:
-----------------------------------

Patch for 4.3 seems straightforward, but testing in 5.0 is a bit complicated. 
Will provide patches next day.

> Oozie Ha timing issue
> ---------------------
>
>                 Key: OOZIE-2847
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2847
>             Project: Oozie
>          Issue Type: Bug
>          Components: HA
>    Affects Versions: 4.3.0
>            Reporter: Péter Gergő Barna
>            Assignee: Denes Bodo
>            Priority: Minor
>         Attachments: OOZIE-2847.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Oozie Ha timing issue
> When Oozie is launching the mapper, it is writing a job id into a file on 
> hdfs. Let's assume the ApplicationMaster is killed, and Oozie will make a 
> second try, during recovery. On the second try, Oozie is trying to see if the 
> previously written job id on hdfs matches the current job id. In most 
> occasion, this will match. However, in the event when Oozie launcher is 
> killed right in the middle when Oozie is in the process of writing id in the 
> file, the Oozie file in hdfs is created, but the id has yet to be written to 
> the file. During the next recovery, Oozie will mistakenly think the id exists 
> in the file while the file is actually empty, therefore throwing this 
> exception: 
> {noformat}
> 2015-07-10 
> 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> 2015-07-10 05:56:58,137|beaver.machine|INFO|5208|1344|MainThread|Console URL  
>      : http://dal-ha21:8088/proxy/application_1436507526035_0001/
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error Code   
>      : JA018
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Error 
> Message     : Hadoop job Id mismatch, action file 
> [hdfs://hdp2-ha2/user/hadoopqa/oozie-hado/0000003-150710041341636-oozie-hado-W/pig-node--pig/0000003-150710041341636-oozie-hado-W@pig-node@0]
>  declares Id [null] current Id [job_1436507526035_0001]
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External ID  
>      : job_1436507526035_0001
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|External 
> Status   : FAILED/KILLED
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Name         
>      : pig-node
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Retries      
>      : 0
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Tracker URI  
>      : dal-ha21:8032
> 2015-07-10 05:56:58,138|beaver.machine|INFO|5208|1344|MainThread|Type         
>      : pig
> 2015-07-10 05:56:58,158|beaver.machine|INFO|5208|1344|MainThread|Started      
>      : 2015-07-10 05:55:19 GMT
> 2015-07-10 05:56:58,160|beaver.machine|INFO|5208|1344|MainThread|Status       
>      : ERROR
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|Ended        
>      : 2015-07-10 05:56:42 GMT
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External 
> Stats    : null
> 2015-07-10 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|External 
> ChildIDs : null
> 2015-07-10 
> 05:56:58,161|beaver.machine|INFO|5208|1344|MainThread|------------------------------------------------------------------------------------------------------------------------------------
> Exception:
> 2015-07-10 05:56:18,658 INFO [main] 
> org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file 
> system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,665 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at 
> hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> 2015-07-10 05:56:18,693 WARN [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Unable to parse prior job 
> history, aborting recovery
> java.io.IOException: Incompatible event log version: null
>       at 
> org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71)
>       at 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.parsePreviousJobHistory(MRAppMaster.java:1206)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.processRecovery(MRAppMaster.java:1175)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1039)
>       at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1519)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1515)
>       at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1448)
> 2015-07-10 05:56:18,737 INFO [main] 
> org.apache.hadoop.mapreduce.v2.jobhistory.JobHistoryUtils: Default file 
> system [hdfs://hdp2-ha2:8020]
> 2015-07-10 05:56:18,745 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Previous history file is at 
> hdfs://hdp2-ha2:8020/user/hadoopqa/.staging/job_1436507526035_0001/job_1436507526035_0001_1.jhist
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to