GitHub user skonto opened a pull request:

    https://github.com/apache/spark/pull/18705

    [SPARK-21502][Mesos] fix --supervise for mesos in cluster mode

    ## What changes were proposed in this pull request?
    With supervise enabled for a driver so far re-launching it was failing 
because the driver had the same framework Id. This patch creates a new driver 
framework id every time we re-launch a driver, but we keep the driver 
submission id the same since that is the same with the task id the driver was 
launched with on mesos and retry state and other info within Dispatcher's data 
structures uses that as a key.
    We append a "-retry-%4d" string as a suffix to the framework id passed by 
the dispatcher to the driver and the same value to the app_id created by each 
driver, except the first time where we dont need the retry suffix.
    The previous format for the frameworkId was   
'DispactherFId-DriverSubmissionId'.
    
    We also detect the case where we have multiple spark contexts started from 
within the same driver and we do set proper names to their corresponding 
app-ids. The old practice was to unset the framework id passed from the 
dispatcher after the driver framework was started for the first time and let 
mesos decide the framework ID for subsequent spark contexts. The decided fId 
was passed as an appID.
    This patch affects heavily the history server. Btw we dont have the issues 
of the standalone case where driver id must be different since the dispatcher 
will re-launch a driver(mesos task) only if it gets an update that it is dead 
and this is verified by mesos implicitly. We also dont fix the fine grained 
mode which deprecated and of no use.
    
    ## How was this patch tested?
    
    This task was manually tested on dc/os. Launched a driver, stops its 
container and verified the expected behavior.
    
    Initial retry of the driver, driver in pending state:
    
    
![image](https://user-images.githubusercontent.com/7945591/28473862-1088b736-6e4f-11e7-8d7d-7b785b1da6a6.png)
    
    Driver re-launched:
    
![image](https://user-images.githubusercontent.com/7945591/28473885-26e02d16-6e4f-11e7-9eb8-6bf7bdb10cb8.png)
    
    Another re-try:
    
![image](https://user-images.githubusercontent.com/7945591/28473897-35702318-6e4f-11e7-9585-fd295ad7c6b6.png)
    
    The resulted entries in history server at the bottom:
    
    
![image](https://user-images.githubusercontent.com/7945591/28473910-4946dabc-6e4f-11e7-90a6-fa4f80893c61.png)
    
    Regarding multiple spark contexts here is the end result regarding the 
spark history server:
    
    
![image](https://user-images.githubusercontent.com/7945591/28474432-69cf8b06-6e51-11e7-93c7-e6c0b04dec93.png)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/skonto/spark fix_supervise_flag

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18705
    
----
commit b987c4b28c3aa96f39e78dcc74da570226c6bdba
Author: Stavros Kontopoulos <[email protected]>
Date:   2017-07-21T00:18:34Z

    fix supervise for mesos in cluster mode

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to