Jaydeep Vishwakarma created OOZIE-2314:
------------------------------------------

             Summary: Unable to kill old instance child job by workflow or 
coord rerun by Launcher
                 Key: OOZIE-2314
                 URL: https://issues.apache.org/jira/browse/OOZIE-2314
             Project: Oozie
          Issue Type: Bug
            Reporter: Jaydeep Vishwakarma


Oozie launcher kills all the child jobs which, launched by an old instance of 
same launcher, workflow or coord action to avoid the duplicate child running at 
same. For same it searches the application ids by tag and time, And it kills 
all AMs. You can find more detail in OOZIE-2129. 
It works fine when Launcher attempt gets killed and tries again. In case of 
Yarn container which contains AM get kills due to some reason and we run 
workflow/coord action this patch does not work.
   It happens due to a time filter applied during finding the app ids, which 
always takes the current time from the server.
   {{LauncherMapperHelper.java}}
   {code}
       public static void setupYarnRestartHandling(JobConf launcherJobConf, 
Configuration actionConf, String launcherTag)
               throws NoSuchAlgorithmException {
           
launcherJobConf.setLong(LauncherMainHadoopUtils.OOZIE_JOB_LAUNCH_TIME, 
System.currentTimeMillis());
           // Tags are limited to 100 chars so we need to hash them to make 
sure (the actionId otherwise doesn't have a max length)
           String tag = getTag(launcherTag);
           // keeping the oozie.child.mapreduce.job.tags instead of 
mapreduce.job.tags to avoid killing launcher itself.
           // mapreduce.job.tags should only go to child job launch by launcher.
           actionConf.set(LauncherMainHadoopUtils.CHILD_MAPREDUCE_JOB_TAGS, 
tag);
       }
   {code}

When a user rerun the workflow or coord action, Launcher picks the current 
system time along with tags, It searches for running application ids and kills 
them. It eventually does not find any App Id, As the previous instance of the 
same workflow/coord ran before the new system time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to