spark: active container remains after failed job

Diana Carroll (JIRA) Fri, 07 Aug 2015 06:39:12 -0700

Diana Carroll created OOZIE-2326:
------------------------------------

             Summary: oozie/yarn/spark: active container remains after failed 
job
                 Key: OOZIE-2326
                 URL: https://issues.apache.org/jira/browse/OOZIE-2326
             Project: Oozie
          Issue Type: Bug
          Components: workflow
    Affects Versions: 4.1.0
         Environment: pseudo-distributed (single VM), CentOS 6.6, CDH 5.4.3
            Reporter: Diana Carroll



Issue occurs when I launch a Spark job (local mode) that fails.  (My example 
failed because I tried to read a non-existent file).  When this occur, the job 
fails, and YARN ends up in a weird state: the RM manager shows the launch job 
has completed...but a container for the job is still live on the slave node.  
Because I'm running in pseudo-dist mode, this totally hangs my cluster: no 
other jobs can run because there are only resources for a single container, and 
that container is running the dead Oozie launcher.

If I wait long enough, YARN will eventually time out and release the container 
and start accepting new jobs.  But until then I'm dead in the water.

Attaching screen shots that show the state right after running the failed job:
the RM shows no jobs running
the node shows one container running
Also attaching a log file for the oozie job and the container.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (OOZIE-2326) oozie/yarn/spark: active container remains after failed job

Reply via email to