> On April 14, 2014, 5:04 p.m., Benjamin Hindman wrote:
> > src/slave/slave.cpp, lines 3116-3118
> > <https://reviews.apache.org/r/20221/diff/2/?file=554602#file554602line3116>
> >
> >     I think what this is saying is:
> >     
> >     If we have a valid run (determined in the codce above) then we're sure 
> > to have a checkpointed ExecutorInfo because the ExecutorInfo is 
> > checkpointed before we checkpoint any information about a run.
> >     
> >     But is it possible that a run is valid but for whatever reason 
> > recovering the ExecutorInfo fails? For example, because the file got 
> > corrupted, or by accidentally deleted?
> 
> Niklas Nielsen wrote:
>     If the executor info file gets corrupted or deleted, the check would fail.
>     
>     How about extending the test on entry (that ensures presence of runs and 
> gracefully GC's and abort recovery?) with ... || state.info.isNone() ?
>     The test will be removed in the task info patch anyway as we deal with 
> the missing executor info explicitly there.

i like the second suggestion because hard failing on an executor corruption is 
bad.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/20221/#review40269
-----------------------------------------------------------


On April 10, 2014, 8:26 p.m., Niklas Nielsen wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/20221/
> -----------------------------------------------------------
> 
> (Updated April 10, 2014, 8:26 p.m.)
> 
> 
> Review request for mesos, Ian Downes and Vinod Kone.
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This patch let executor recovery recover runs in the absence of
> executor info.  This is needed as new task-info patch will introduce
> an intermediate state where the executor info hasn't been check
> pointed. In this interim, the slave may fail-over and should be in a
> position to clean up orphan containers (as for now, the containerizer
> API doesn't provide a way to reconcile the executor info and it is
> therefore not possible to recover the containers in this case).
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp cddb241 
>   src/slave/state.cpp 21d1fb7 
> 
> Diff: https://reviews.apache.org/r/20221/diff/
> 
> 
> Testing
> -------
> 
> make check and tested with task-info patch and new launch test.
> 
> 
> Thanks,
> 
> Niklas Nielsen
> 
>

Reply via email to