> On Dec. 7, 2016, 9:53 p.m., Joseph Wu wrote: > > src/master/master.cpp, lines 2841-2843 > > <https://reviews.apache.org/r/54495/diff/1/?file=1579042#file1579042line2841> > > > > Do you want to force a relink too? > > > > i.e. give this as the second argument: > > `process::RemoteConnection::RECONNECT`
Per discussion on Slack with Joseph, it seems we don't need to force a reconnect here. Because the master will promptly send a (re-)registered message to the framework; if the socket is half-open, that should eventually result in an error due to the socket send. This will result in another `exited` event, at which point we'll correctly mark the framework as disconnected again and send it another `FrameworkErrorMessage`. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54495/#review158408 ----------------------------------------------------------- On Dec. 7, 2016, 8:04 p.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/54495/ > ----------------------------------------------------------- > > (Updated Dec. 7, 2016, 8:04 p.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-6676 > https://issues.apache.org/jira/browse/MESOS-6676 > > > Repository: mesos > > > Description > ------- > > In the following scenario: > * Master sees a re-registration attempt from a PID-based scheduler, > * The scheduler was previously registered with the master, > * and the "force" flag is not set > > The master neglected to re-link with the scheduler. For example, this > might happen if: > > * The master sees an ExitedEvent for the framework and marks it > disconnected. > * The master sends a FrameworkErrorMessage to the framework but this > message is dropped, e.g., due to a transient network failure. > * The scheduler attempts to re-register with the master, e.g., because > it detects (spuriously) that the current leading master has changed. > > This is problematic, because it might leave the master -> scheduler > connection using an ephemeral socket. > > > Diffs > ----- > > src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de > > Diff: https://reviews.apache.org/r/54495/diff/ > > > Testing > ------- > > `make check` > > Note that it would be _great_ to write a unit test for this situation (as > well as a class of related failure conditions), but the current testing > infrastructure doesn't make that easy. > > > Thanks, > > Neil Conway > >
