[ 
https://issues.apache.org/jira/browse/SPARK-23943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

paul mackles updated SPARK-23943:
---------------------------------
    Description: 
Two changes in this PR:
 * A /health endpoint for a quick binary indication on the health of 
MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
Returns a 503 status if the server is unhealthy and a 200 if the server is 
healthy
 * A /status endpoint for a more detailed examination on the current state of a 
MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool

For both endpoints, regardless of status code, the following body is returned:

 
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "iamok",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true,
  "pendingRetryDrivers" : 0
}{code}
Aside from surfacing all of the scheduler metrics, the response also includes 
the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
other failure. When this happens, jobs queue up and the only way to clean 
things up is to restart the service. 

With the above health check, marathon can be configured to automatically 
restart the MesosClusterDispatcher service when the health check fails, 
lessening the need for manual intervention.

  was:
Two changes:

First, a more robust 
[health-check|[http://mesosphere.github.io/marathon/docs/health-checks.html]] 
for anyone who runs MesosClusterDispatcher as a marathon app. Specifically, 
this check verifies that the MesosSchedulerDriver is still running as we have 
seen certain cases where it stops (rather quietly) and the only way to revive 
it is a restart. With this health check, marathon will restart the dispatcher 
if the MesosSchedulerDriver stops running. The health check lives at the url 
"/health" and returns a 204 when the server is healthy and a 503 when it is not 
(e.g. the MesosSchedulerDriver stopped running).

Second, a server status endpoint that replies with some basic metrics about the 
server. The status endpoint resides at the url "/status" and responds with:
{code:java}
{
  "action" : "ServerStatusResponse",
  "launchedDrivers" : 0,
  "message" : "server OK",
  "queuedDrivers" : 0,
  "schedulerDriverStopped" : false,
  "serverSparkVersion" : "2.3.1-SNAPSHOT",
  "success" : true
}{code}
As you can see, it includes a snapshot of the metrics/health of the scheduler. 
Useful for quick debugging/troubleshooting/monitoring. 


> Improve observability of MesosRestServer/MesosClusterDispatcher
> ---------------------------------------------------------------
>
>                 Key: SPARK-23943
>                 URL: https://issues.apache.org/jira/browse/SPARK-23943
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy, Mesos
>    Affects Versions: 2.2.1, 2.3.0
>         Environment:  
>  
>            Reporter: paul mackles
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> Two changes in this PR:
>  * A /health endpoint for a quick binary indication on the health of 
> MesosClusterDispatcher. Useful for those running MesosClusterDispatcher as a 
> marathon app: [http://mesosphere.github.io/marathon/docs/health-checks.html]. 
> Returns a 503 status if the server is unhealthy and a 200 if the server is 
> healthy
>  * A /status endpoint for a more detailed examination on the current state of 
> a MesosClusterDispatcher instance. Useful as a troubleshooting/monitoring tool
> For both endpoints, regardless of status code, the following body is returned:
>  
> {code:java}
> {
>   "action" : "ServerStatusResponse",
>   "launchedDrivers" : 0,
>   "message" : "iamok",
>   "queuedDrivers" : 0,
>   "schedulerDriverStopped" : false,
>   "serverSparkVersion" : "2.3.1-SNAPSHOT",
>   "success" : true,
>   "pendingRetryDrivers" : 0
> }{code}
> Aside from surfacing all of the scheduler metrics, the response also includes 
> the status of the Mesos SchedulerDriver. On numerous occasions now, we have 
> observed scenarios where the Mesos SchedulerDriver quietly exits due to some 
> other failure. When this happens, jobs queue up and the only way to clean 
> things up is to restart the service. 
> With the above health check, marathon can be configured to automatically 
> restart the MesosClusterDispatcher service when the health check fails, 
> lessening the need for manual intervention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to