[ https://issues.apache.org/jira/browse/MESOS-8630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429105#comment-16429105 ]
Yan Xu commented on MESOS-8630: ------------------------------- A first step could be to identify all the places that updates the registry and {{LOG(FATAL)}}, we can also see if we can abstract it out. > All subsequent registry operations fail after the registrar is aborted after > a failed update > -------------------------------------------------------------------------------------------- > > Key: MESOS-8630 > URL: https://issues.apache.org/jira/browse/MESOS-8630 > Project: Mesos > Issue Type: Bug > Components: master > Reporter: Yan Xu > Priority: Major > > Failure to update registry always aborts the registrar but don't always abort > the master process. > When the registrar fails to update the registry it would abort the actor and > fail all future operations. The rationale as explained here: > [https://github.com/apache/mesos/commit/5eaf1eb346fc2f46c852c1246bdff12a89216b60] > {quote}In this event, the Master won't commit suicide until the initial > failure is processed. However, in the interim, subsequent operations > are potentially being performed against the Registrar. This could lead > to fighting between masters if a "demoted" master re-attempts to > acquire log-leadership! > {quote} > However when the registrar updates is requested by an operator API > (maintenance, quota update, etc) the master process doesn't shut down (a 500 > error is returned to the client instead) and all subsequent operations will > fail! -- This message was sent by Atlassian JIRA (v7.6.3#76005)