----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8680/#review15559 -----------------------------------------------------------
Biggest piece of feedback here is really deciding what is a recovery error and what can just be printed and ignored. We talked a bit about this offline, but I don't like a semantics where errors are just printed out in logs and require operators to go sleuthing. src/slave/flags.hpp <https://reviews.apache.org/r/8680/#comment33589> !? s/Enable recovery/recover/ src/slave/flags.hpp <https://reviews.apache.org/r/8680/#comment33590> You don't need to use a hypen with reconnect, it's a word. ;) src/slave/flags.hpp <https://reviews.apache.org/r/8680/#comment33591> Ditto above. src/slave/flags.hpp <https://reviews.apache.org/r/8680/#comment33593> Do we rename this cleanup later? src/slave/flags.hpp <https://reviews.apache.org/r/8680/#comment33592> Do these semantics change later? src/slave/slave.hpp <https://reviews.apache.org/r/8680/#comment33595> s/its// src/slave/slave.hpp <https://reviews.apache.org/r/8680/#comment33594> s/re-connect/reconnect/ src/slave/slave.hpp <https://reviews.apache.org/r/8680/#comment33596> s/../.,/ src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33597> Let's not scare our users, use an if check and EXIT please. src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33598> Rather than require keeping this in sync with the message in flags.hpp, how about ask them to use '--help' to see the valid options? src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33602> Here's a thought: Instead of the boolean 'recovered', make 'recover' return a future. Then, move all of the "installing" and files attaching into a new function called '_initialize' (the continuation of 'initialize'). Now, in 'initialize' do: recover(flags.recover.get() == "reconnect") .then(defer(self(), &Self::_initialize)); The only crux here is in doReliableRegistration. One solution would be to save the future you get from this chain (called it 'recovery' or 'recovered' even) and use that just like you do (but first checking .isReady() and then doing .get()). I'm not sure how doReliableRegistration actually gets reinvoked as the code stands now, but I'll address that in a comment below. src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33600> Maybe move making sure value of recover is valid down here so it's closer with the conditional? src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33603> You're only storing the ID if we're checkpointing? src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33604> So when will we attempt to register in the future? Again, I think using something like futures is the key here. Basically we want to future "events" to occur in order for us to attempt to register. The first event is that we have a new master. And the second event is that we've completed recovery. src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33605> s/foreachvalue(/foreachvalue (/ src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33606> What do you think about killing the 'id' field and always reading from info.id()? src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33607> Do we need a Checkpointer? src/slave/slave.cpp <https://reviews.apache.org/r/8680/#comment33608> This is where recover could return a failed future ... probably don't want to continue if recovery fails without explicit action from an operator! src/slave/state.hpp <https://reviews.apache.org/r/8680/#comment33609> The style has been to put functions above fields per visibility scoping (public, private, etc). And I tend to put static members (functions and fields) above instance members too. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33610> If the result is none this means the file is empty right? Want to print that out? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33611> The error message is a little redundant don't you think? ;) src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33612> What about isNone? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33614> Looking forward to a templated read that just returns type T. ;) src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33613> Ditto comment above about printing out the file is empty. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33615> What constitutes making this a WARNING instead of an INFO? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33616> Ditto above comments. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33617> Again, what constitutes WARNING? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33619> Let's capitalize our acronyms in comments: s/uuid/UUID/ src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33621> IMHO, using an else branch instead of a continue here links these two paragraphs together better. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33620> Seems like you might want to propagate this error? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33622> What does an error here mean? Why not return it? And what about the none case of pid? If we've sufficiently eliminated the none case a CHECK should be put in place until we replace Result with Try in stout/os.hpp. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33624> Again, what does an error mean here? Why not propagate? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33623> Ditto WARNING comments above. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33625> Why not propagate? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33626> Another ditto from earlier comments about saying file is empty (and maybe even better, what that possibly means?). src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33627> s/udpates/updates/ Why not propagate? src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33633> When protobuf::read fails it should set the seek position back to the position before it attempted the read, so you should be able to just truncate at the "current seek position" below instead of adding up size each time. Just wanted to propose the option. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33628> If you're grabbing the sterror anyway, might as well use PCHECK. src/slave/state.cpp <https://reviews.apache.org/r/8680/#comment33629> It might be helpful during debugging to demarcate the message using single quotes or newlines. Same goes for the CHECK_SOME above with protobufs, and newlines probably make more sense there. src/tests/paths_tests.cpp <https://reviews.apache.org/r/8680/#comment33630> s/PathsTest:/PathsTest :/ src/tests/paths_tests.cpp <https://reviews.apache.org/r/8680/#comment33631> Why protected? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33632> s/SlaveStateTest:/SlaveStateTest :/ src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33635> Ditto protected comment. src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33634> s/SlaveRecoveryTest:/SlaveRecoverTest :/ src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33636> Again, why protected instead of public? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33638> I just want to confirm that you want one work directory for all the SlaveRecoveryTest tests? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33637> Move to tests/utils.hpp? Maybe add a few more default parameters? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33640> Put Trigger on own line. src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33642> CHECK or ASSERT? Kill the process or the test? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33639> Why was the cast needed? Do you just need to use 1U or 1L? src/tests/slave_recovery_tests.cpp <https://reviews.apache.org/r/8680/#comment33641> While it makes the code longer, I think a lot of these might be easier read if you broke each '.' on a newline: ASSERT_TRUE(state .frameworks[frameworkId] .executors[executorId] ...); Just a thought. - Benjamin Hindman On Jan. 4, 2013, 2:25 a.m., Vinod Kone wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8680/ > ----------------------------------------------------------- > > (Updated Jan. 4, 2013, 2:25 a.m.) > > > Review request for mesos, Benjamin Hindman and Ben Mahler. > > > Description > ------- > > This only covers recovering slave state. > > Recovering status update manager and isolation modules will be coming in next > reviews. > > > Diffs > ----- > > src/Makefile.am c94736df660a25b58dc47c07d9c56c3c26152a66 > src/common/type_utils.hpp fde69aeec403b3455839dca6b0b2e1507d81ba00 > src/slave/constants.hpp ddf02570caf3793106b3c48e158a5bb48c1ae80c > src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd > src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3 > src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4 > src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a > src/slave/process_based_isolation_module.hpp > f1817192582e3646f8dcf17934ba7998829e8fd6 > src/slave/process_based_isolation_module.cpp > 3d50a4b652e4e09dd57e744e408c8fb79ff3fbf5 > src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039 > src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 > src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445 > src/slave/state.cpp 4d4c2470c44fe630ec0694ee937131b4f0aafc4e > src/slave/status_update_manager.hpp PRE-CREATION > src/tests/exception_tests.cpp 13355d08788432ed07679daf24c2d74cc12a7f11 > src/tests/paths_tests.cpp PRE-CREATION > src/tests/slave_recovery_tests.cpp PRE-CREATION > src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44 > src/tests/status_update_manager_tests.cpp PRE-CREATION > > Diff: https://reviews.apache.org/r/8680/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Vinod Kone > >
