> On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > Biggest piece of feedback here is really deciding what is a recovery error > > and what can just be printed and ignored. We talked a bit about this > > offline, but I don't like a semantics where errors are just printed out in > > logs and require operators to go sleuthing.
As discussed offline, I've decided to short-circuit recovery on ANY and ALL types of recovery errors. While this makes the code safe and easy to reason about, I'm still skeptical that this is going to be practical. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/flags.hpp, line 116 > > <https://reviews.apache.org/r/8680/diff/4/?file=244659#file244659line116> > > > > Do we rename this cleanup later? yes! it should be in the 'Recover executors' review. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/flags.hpp, lines 120-121 > > <https://reviews.apache.org/r/8680/diff/4/?file=244659#file244659line120> > > > > Do these semantics change later? yes. as you will see in the 'Recover executors' review, this flag is no longer optional. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 433 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line433> > > > > You're only storing the ID if we're checkpointing? i'm checkpointing info. i had to update 'id' inside info because this is the place a slave is assigned an 'id'. the other fields of 'info' are set in initialize(). > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, lines 483-486 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line483> > > > > So when will we attempt to register in the future? Again, I think using > > something like futures is the key here. Basically we want to future > > "events" to occur in order for us to attempt to register. The first event > > is that we have a new master. And the second event is that we've completed > > recovery. as discussed offline, this was a bug in the logic that surprisingly didn't get caught in tests! > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 569 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line569> > > > > What do you think about killing the 'id' field and always reading from > > info.id()? i think its great. one source of truth. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, lines 974-975 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line974> > > > > Do we need a Checkpointer? Yes. But this gets a bit complicated when using 'async' because async doesn't support void functions, whereas the callbacks on future only expect void functions. I will leave it as is for now, and revisit this in future as a separate diff (Part 12!) > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 41 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line41> > > > > If the result is none this means the file is empty right? Want to print > > that out? i will wait for benm's protobuf read/write reviews to be committed before making any changes. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 98 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line98> > > > > Looking forward to a templated read that just returns type T. ;) can't wait for benm's code to be committed! > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 102 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line102> > > > > Ditto comment above about printing out the file is empty. ditto > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 232 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line232> > > > > What does an error here mean? Why not return it? And what about the > > none case of pid? If we've sufficiently eliminated the none case a CHECK > > should be put in place until we replace Result with Try in stout/os.hpp. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 338 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line338> > > > > When protobuf::read fails it should set the seek position back to the > > position before it attempted the read, so you should be able to just > > truncate at the "current seek position" below instead of adding up size > > each time. Just wanted to propose the option. good point. fixed > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 125 > > <https://reviews.apache.org/r/8680/diff/4/?file=244672#file244672line125> > > > > Again, why protected instead of public? this is how it was done in google test documentation for setup/teardown. https://code.google.com/p/googletest/wiki/AdvancedGuide. any reason it should be public? > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 128 > > <https://reviews.apache.org/r/8680/diff/4/?file=244672#file244672line128> > > > > I just want to confirm that you want one work directory for all the > > SlaveRecoveryTest tests? it gets cleaned up after every test. but you are right, its cleaner to create a different directory per test. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/tests/slave_recovery_tests.cpp, line 371 > > <https://reviews.apache.org/r/8680/diff/4/?file=244672#file244672line371> > > > > While it makes the code longer, I think a lot of these might be easier > > read if you broke each '.' on a newline: > > > > ASSERT_TRUE(state > > .frameworks[frameworkId] > > .executors[executorId] > > ...); > > > > Just a thought. agreed. fixed all those that didn't fit on one line > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 334 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line334> > > > > Here's a thought: > > > > Instead of the boolean 'recovered', make 'recover' return a future. > > Then, move all of the "installing" and files attaching into a new function > > called '_initialize' (the continuation of 'initialize'). Now, in > > 'initialize' do: > > > > recover(flags.recover.get() == "reconnect") > > .then(defer(self(), &Self::_initialize)); > > > > The only crux here is in doReliableRegistration. One solution would be > > to save the future you get from this chain (called it 'recovery' or > > 'recovered' even) and use that just like you do (but first checking > > .isReady() and then doing .get()). I'm not sure how doReliableRegistration > > actually gets reinvoked as the code stands now, but I'll address that in a > > comment below. Its not clear to me, atleast from the point of this review, why we would want recovery to be asynchronous. i will revisit this when i work on the downstream reviews where the complete recovery is implemented. Added a TODO for now. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/slave.cpp, line 1438 > > <https://reviews.apache.org/r/8680/diff/4/?file=244666#file244666line1438> > > > > This is where recover could return a failed future ... probably don't > > want to continue if recovery fails without explicit action from an operator! made this LOG(FATAL) for now. > On Jan. 22, 2013, 4:22 a.m., Benjamin Hindman wrote: > > src/slave/state.cpp, line 332 > > <https://reviews.apache.org/r/8680/diff/4/?file=244668#file244668line332> > > > > s/udpates/updates/ > > > > Why not propagate? - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/8680/#review15559 ----------------------------------------------------------- On Jan. 4, 2013, 2:25 a.m., Vinod Kone wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/8680/ > ----------------------------------------------------------- > > (Updated Jan. 4, 2013, 2:25 a.m.) > > > Review request for mesos, Benjamin Hindman and Ben Mahler. > > > Description > ------- > > This only covers recovering slave state. > > Recovering status update manager and isolation modules will be coming in next > reviews. > > > Diffs > ----- > > src/Makefile.am c94736df660a25b58dc47c07d9c56c3c26152a66 > src/common/type_utils.hpp fde69aeec403b3455839dca6b0b2e1507d81ba00 > src/slave/constants.hpp ddf02570caf3793106b3c48e158a5bb48c1ae80c > src/slave/flags.hpp 39e57f4104ee7a1538436ebbb9493581e28c99dd > src/slave/isolation_module.hpp b962365ebeddd047896a66b02a327aa26ae323d3 > src/slave/lxc_isolation_module.hpp 2bc844f491befbe588965da2ada7cfcef0b6f0a4 > src/slave/paths.hpp fbf3fd84fb8f2590311b18d2afec2d2e0d30ef0a > src/slave/process_based_isolation_module.hpp > f1817192582e3646f8dcf17934ba7998829e8fd6 > src/slave/process_based_isolation_module.cpp > 3d50a4b652e4e09dd57e744e408c8fb79ff3fbf5 > src/slave/slave.hpp e9f7b659ca2860501840b3d01e69915ebd162039 > src/slave/slave.cpp 9755b46f97173d6fcc9ab1fd63e0e4814b3bc018 > src/slave/state.hpp c1e4c782380b0076313d2216c40e86774050d445 > src/slave/state.cpp 4d4c2470c44fe630ec0694ee937131b4f0aafc4e > src/slave/status_update_manager.hpp PRE-CREATION > src/tests/exception_tests.cpp 13355d08788432ed07679daf24c2d74cc12a7f11 > src/tests/paths_tests.cpp PRE-CREATION > src/tests/slave_recovery_tests.cpp PRE-CREATION > src/tests/slave_state_tests.cpp 1136ea93cffb1483458edad2605b0b4f83b61c44 > src/tests/status_update_manager_tests.cpp PRE-CREATION > > Diff: https://reviews.apache.org/r/8680/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Vinod Kone > >
