[ 
https://issues.apache.org/jira/browse/SPARK-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Pran updated SPARK-16752:
-----------------------------
    Description: 
We are having a strange issue with Spark Job Server (SJS)

We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details of 
settings.sh for SJS is as below

********************************************************************

INSTALL_DIR=$(cd `dirname $0`; pwd -P)
LOG_DIR=$INSTALL_DIR/logs
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=16G
SPARK_VERSION=1.5.0
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.10.4

********************************************************************

We are using fair scheduling with 2 pools with 50 executors of 1 GB each.

We are also having max-jobs-per-context set to # of cores, which is 48.

What we are seeing is for the first 5 minutes or so, it is all good ...the jobs 
get processed fine.

After 5 minutes or so, we see these 2 issues happening randomly.

1) There are no jobs running in the cluster, completely available, but SJS 
takes request, but does not submit it to the cluster for almost 3 - 4 minutes 
and the job will be in "running job" list for that long.

2) SJS takes request, submits it to cluster, job gets processed from cluster, 
but even then, SJS does not move the job to completed list, it keeps it in 
"running job" list for 3 - 4 minutes before moving it to completed job list and 
during this time, our application keeps waiting for the response.

More issue details are documented in the external issue URL given below

Detailed steps outlined below

#1 The screenshot (SJS_JOBS_RUNNING) is of running job list.

    Please look at the 1st row and of the last row, the time submitted for the 
last job Id in the screenshot (4747ae86-7de3-4819-a29c-2b2c80c568a2) is 
"16:49:00"

#2  If you look at 2nd screenshot (SJS_JOB_COMP_YARN) from Spark Yarn cluster, 
the job was completed at "16:49:25" itself  

#3  The 3rd screenshot (SJS_JOB_LOG_CONSOLE) is coming from the Spark Job 
Server log, it says the same job completed at "17:13:55"

So, SJS was basically holding onto the job for more than 14 minutes and kept it 
in the running job list although Yarn responded back in time.

Also, please take a look at the SJS log attached for the time period around 
when this job was submitted.

  was:
We are having a strange issue with Spark Job Server (SJS)

We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details of 
settings.sh for SJS is as below

********************************************************************

INSTALL_DIR=$(cd `dirname $0`; pwd -P)
LOG_DIR=$INSTALL_DIR/logs
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=16G
SPARK_VERSION=1.5.0
SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.10.4

********************************************************************

We are using fair scheduling with 2 pools with 50 executors of 1 GB each.

We are also having max-jobs-per-context set to # of cores, which is 48.

What we are seeing is for the first 5 minutes or so, it is all good ...the jobs 
get processed fine.

After 5 minutes or so, we see these 2 issues happening randomly.

1) There are no jobs running in the cluster, completely available, but SJS 
takes request, but does not submit it to the cluster for almost 3 - 4 minutes 
and the job will be in "running job" list for that long.

2) SJS takes request, submits it to cluster, job gets processed from cluster, 
but even then, SJS does not move the job to completed list, it keeps it in 
"running job" list for 3 - 4 minutes before moving it to completed job list and 
during this time, our application keeps waiting for the response.

More issue details are documented in the external issue URL given below


> Spark Job Server not releasing jobs from "running list" even after yarn 
> completes the job
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-16752
>                 URL: https://issues.apache.org/jira/browse/SPARK-16752
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.6.0, 1.5.0
>         Environment: SJS version 0.6.1 and Spark 1.5.0 running on Yarn-client 
> mode
>            Reporter: Ash Pran
>              Labels: patch
>         Attachments: SJS_JOBS_RUNNING, SJS_JOB_COMP_YARN, 
> SJS_JOB_LOG_CONSOLE, SJS_Limited_Log.txt
>
>
> We are having a strange issue with Spark Job Server (SJS)
> We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details 
> of settings.sh for SJS is as below
> ********************************************************************
> INSTALL_DIR=$(cd `dirname $0`; pwd -P)
> LOG_DIR=$INSTALL_DIR/logs
> PIDFILE=spark-jobserver.pid
> JOBSERVER_MEMORY=16G
> SPARK_VERSION=1.5.0
> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark
> SPARK_CONF_DIR=$SPARK_HOME/conf
> SCALA_VERSION=2.10.4
> ********************************************************************
> We are using fair scheduling with 2 pools with 50 executors of 1 GB each.
> We are also having max-jobs-per-context set to # of cores, which is 48.
> What we are seeing is for the first 5 minutes or so, it is all good ...the 
> jobs get processed fine.
> After 5 minutes or so, we see these 2 issues happening randomly.
> 1) There are no jobs running in the cluster, completely available, but SJS 
> takes request, but does not submit it to the cluster for almost 3 - 4 minutes 
> and the job will be in "running job" list for that long.
> 2) SJS takes request, submits it to cluster, job gets processed from cluster, 
> but even then, SJS does not move the job to completed list, it keeps it in 
> "running job" list for 3 - 4 minutes before moving it to completed job list 
> and during this time, our application keeps waiting for the response.
> More issue details are documented in the external issue URL given below
> Detailed steps outlined below
> #1 The screenshot (SJS_JOBS_RUNNING) is of running job list.
>     Please look at the 1st row and of the last row, the time submitted for 
> the last job Id in the screenshot (4747ae86-7de3-4819-a29c-2b2c80c568a2) is 
> "16:49:00"
> #2  If you look at 2nd screenshot (SJS_JOB_COMP_YARN) from Spark Yarn 
> cluster, the job was completed at "16:49:25" itself  
> #3  The 3rd screenshot (SJS_JOB_LOG_CONSOLE) is coming from the Spark Job 
> Server log, it says the same job completed at "17:13:55"
> So, SJS was basically holding onto the job for more than 14 minutes and kept 
> it in the running job list although Yarn responded back in time.
> Also, please take a look at the SJS log attached for the time period around 
> when this job was submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to