Re: Review Request 14616: Refactored slave recovery to make reading checkpointed data asynchronous.

Ben Mahler Thu, 17 Oct 2013 11:31:37 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14616/#review27136
-----------------------------------------------------------




src/slave/slave.hpp
<https://reviews.apache.org/r/14616/#comment52789>

    I think we can get away without storing the state in the slave, see my 
comments below.



src/slave/slave.cpp
<https://reviews.apache.org/r/14616/#comment52796>

    In the old code, because StatusUpdateManager did not take an optional 
SlaveState, in one case we only recovered the Isolator and in another case we 
recovered both the SUM and the Isolator.
    
    Now that it takes an option, we can always recover both the SUM and the 
Isolator, and if reconnect takes an Option then we can always just call all 
three, leaving only a single return statement! :)
    
    So recoverState() can be _recover(), and reconnect can be __recover():
    
    We can have a single return inside _recover. There is a bit of clunkyness 
going from Result -> Option but we can make that easier in stout :)
    
    recover() {
      async (state::recover)
        .then(defer(self(), &Self::_recover, lambda::_1));
    }
    
    _recover(Result<SlaveState> state) {
      // Check for state being an error.
      if (state is error) { EXIT(1) << "..." }
      
      if (state is non empty) {
        // Check for SlaveInfo compatibility.
        // Check for recoveryErrors.
        // Recover frameworks and executors.
      }
    
      // The only return statement in _recover!
      // Now recover the SUM, the Isolator, and then complete recovery.
      return statusUpdateManager->recover(metaDir, state)
        .then(defer(isolator, &Isolator::recover, state)
        .then(defer(self(), &Self::__recover, state);
    }
    
    __recover(Option<SlaveState> state) {
      // Monitor the executors and reconnect if desired.
    }
    
    It looks like we don't necessarily need to pass reconnect around since it's 
specified in the flag? 
    
    Altogether, we could kill the new helpers and new recoveryState, and have a 
flow of recover->_recover->__recover which may be easier for follow and later 
convert to lambdas if it simplifies things further.


- Ben Mahler


On Oct. 12, 2013, 12:40 a.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14616/
> -----------------------------------------------------------
> 
> (Updated Oct. 12, 2013, 12:40 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Bugs: MESOS-732
>     https://issues.apache.org/jira/browse/MESOS-732
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Improvements:
> 
> --> Reading from disk is now done in a separate libprocess (process::async 
> ftw :))
> 
> --> Streamlined recovery code.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.hpp 22fb74b71a0f52d9d67b92ecc286fa8d350e41a4 
>   src/slave/slave.cpp debb2f4ce05fbfec450197e68bc8a0c78f1d0adf 
>   src/slave/state.cpp 5208e4e8eaaea5aea9e23a6ac7d09822e15433b2 
>   src/slave/status_update_manager.hpp 
> 1f55f868548ae79052856e0255097b447ffe7573 
>   src/slave/status_update_manager.cpp 
> b6afeb18d3bd97ca4f4238b2ffbf6bd7de8e6888 
>   src/tests/slave_recovery_tests.cpp 4d262c65043ad86e0851ac816092f6e8b468a114 
> 
> Diff: https://reviews.apache.org/r/14616/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>

Re: Review Request 14616: Refactored slave recovery to make reading checkpointed data asynchronous.

Reply via email to