> On April 14, 2014, 5:04 p.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, lines 3116-3118 > > <https://reviews.apache.org/r/20221/diff/2/?file=554602#file554602line3116> > > > > I think what this is saying is: > > > > If we have a valid run (determined in the codce above) then we're sure > > to have a checkpointed ExecutorInfo because the ExecutorInfo is > > checkpointed before we checkpoint any information about a run. > > > > But is it possible that a run is valid but for whatever reason > > recovering the ExecutorInfo fails? For example, because the file got > > corrupted, or by accidentally deleted? > > Niklas Nielsen wrote: > If the executor info file gets corrupted or deleted, the check would fail. > > How about extending the test on entry (that ensures presence of runs and > gracefully GC's and abort recovery?) with ... || state.info.isNone() ? > The test will be removed in the task info patch anyway as we deal with > the missing executor info explicitly there.
i like the second suggestion because hard failing on an executor corruption is bad. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/20221/#review40269 ----------------------------------------------------------- On April 10, 2014, 8:26 p.m., Niklas Nielsen wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/20221/ > ----------------------------------------------------------- > > (Updated April 10, 2014, 8:26 p.m.) > > > Review request for mesos, Ian Downes and Vinod Kone. > > > Repository: mesos-git > > > Description > ------- > > This patch let executor recovery recover runs in the absence of > executor info. This is needed as new task-info patch will introduce > an intermediate state where the executor info hasn't been check > pointed. In this interim, the slave may fail-over and should be in a > position to clean up orphan containers (as for now, the containerizer > API doesn't provide a way to reconcile the executor info and it is > therefore not possible to recover the containers in this case). > > > Diffs > ----- > > src/slave/slave.cpp cddb241 > src/slave/state.cpp 21d1fb7 > > Diff: https://reviews.apache.org/r/20221/diff/ > > > Testing > ------- > > make check and tested with task-info patch and new launch test. > > > Thanks, > > Niklas Nielsen > >
