> On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.hpp, line 1308 > > <https://reviews.apache.org/r/25525/diff/8/?file=717604#file717604line1308> > > > > This should be CHECK_SOME(compatible).
True, a incompatible change shouldn't get that far because there are advance checks, but practically the registrar will do the right thing / recover in the case of a compatibility error here. CHECK_SOME() means that we will always fatally exit and kill the master, which seems worse to me. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3100-3102 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3100> > > > > ``` > > LOG(WARNING) << "Slave " << *slave << " attempted to re-register with > > incompatible info: " > > << compatible.error() << "; shutting it down" > > ``` > > > > Also, maybe we don't need to explicitly 'removeSlave(slave)' here? If a > > compatible slave is not re-registered within timeout, master will > > automatically remove it. Updated the message. I would rather explicitly tell the slave to shutdown than wait, otherwise the slave will continue trying to reregister with the master, the master will keep saying no, until the timeout at which point we will tell the slave to shutdown when it tries to re-authenticate with the message "Slave attempted to re-register after removal". Waiting gives the operator worse information, and makes it a longer loop for getting the slave back into the cluster, than giving the notification, and it really isn't much (or very complex) code. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3146-3149 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3146> > > > > Hmm. i don't think we want these semantics. Why not just fail if > > readmission failed? So if the registrar actually fails to readmit the slave, then the master will go down on a LOG(FATAL). In the case where we couldn't readmit the slave for some reason (Ex: the registrar thought the upgrade was incompatible, hit the fall-through case in master/master.hpp), then the slave will be sent a shutdown message and explicitly removed at that point. So I think the semantics you want are there / the comment can just be removed. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 3169-3171 > > <https://reviews.apache.org/r/25525/diff/8/?file=717605#file717605line3169> > > > > This seems very unintuitve for users/operators. Lets fix this in this > > patch rather than doing a TODO. > > > > > > Also, AFAICT, the changes are not lost because both the registrar and > > the allocator are informed about the new slave info. Am I missing something? Note this isn't a fall through from the above code. All the above code exits. At this point, we haven't talked to the registrar or allocator. All that has happened since reciept of the message is authentication, checking that the slave hasn't already been removed, and checking that the master doesn't already know about the slave (getSlave reutrns null). So this can be triggered by a slave reregistering, and that first registration being sent to the Registrar. While that registration is happening, the same slave sends another reregistration message, this time with updated SlaveInfo. The re-registration check will be hit, and the updated SlaveInfo will be discarded. per the check. The solutions I can see: 1) Send a message back to the slave that we already got a reregistration with the last registration's SlaveInfo, and if it doesn't want that one it needs to update it's registration after that one has completed 2) Make it so we can queue multiple in-flight reregistrations for a single slave (No idea if this will break anything / would work if you just removed the check). I worry a little with this option that there could be a lot of reregistrations in flight at the same time to the master and it could get overwhelmed doing extra work where before there was an early exit. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/common/slaveinfo_utils.cpp, lines 64-96 > > <https://reviews.apache.org/r/25525/diff/8/?file=717601#file717601line64> > > > > Hmm. This is still hard to follow. > > > > Using variadic templated function here doesn't make it any simpler to > > read/follow. Infact it makes it harder because now I need to read > > understand what the template function does. I think explicitly checking > > each field (as suggested in the earlier comment) reads better. So I can do that, which gives: ```c++ if (!(newSlaveInfo.id() == oldSlaveInfo.id())) { return Error("id cannot be changed (old: " + stringify(oldSlaveInfo.id()) + " new: " + stringify(newSlaveInfo.id()) + ")"); } if (newSlaveInfo.hostname() != oldSlaveInfo.hostname()) { return Error("hostname cannot be changed (old: " + stringify(oldSlaveInfo.hostname()) + " new: " + stringify(newSlaveInfo.hostname()) + ")"); } if (newSlaveInfo.checkpoint() != oldSlaveInfo.checkpoint()) { return Error("checkpoint cannot be changed (old: " + stringify(oldSlaveInfo.checkpoint()) + " new: " + stringify(newSlaveInfo.checkpoint()) + ")"); } if (newSlaveInfo.port() != oldSlaveInfo.port()) { return Error("port cannot be changed (old: " + stringify(oldSlaveInfo.port()) + " new: " + stringify(newSlaveInfo.port()) + ")"); } ``` My problems with it are: 1. There is a lot of redundant code which needs to be mentally parsed. I have to realize each of those is the same reading 3 lines of code before I can recognize "oh, we're checking over and over again these two fields in the different versions are the same, and mentally condense the code away" 2. How likely is it that you would actually notice all the new vs. old if I got one of the error messages wrong? What about if I made any of a large number of possible slight typos in one vs. the next? What if I compare two different string fields? The compiler can't catch any of these or help me with them. ```c++ checkMembersUnchanged(newSlaveInfo, oldSlaveInfo, make_pair("SlaveID", &SlaveInfo::id), make_pair("hostname", &SlaveInfo::hostname), make_pair("checkpoint", &SlaveInfo::checkpoint), make_pair("port", &SlaveInfo::port)); ``` Where when reading just that code, I see I'm giong to check that the members are unchanged, probably between newSlaveInfo and oldSlaveInfo. That the fields are "SlaveID", "hostname", "checkpoint", and "port". The checkMembersUnchanged takes a little bit more time/effort to recognize, esp. since it is c++11 variadic templates, recursion, and pointers to member functions + calling pointers to member functions (Which is some of the oddest syntax in C++). What is nice about it though is: 1. The function though guarantees I call the same function on both old and new / access the same member 2. Formats the error string consistently, gives me one place to update / change it if needed 3. Makes one place for the ordering to be correct. > On Oct. 15, 2014, 1:25 a.m., Vinod Kone wrote: > > src/master/hierarchical_allocator_process.hpp, line 490 > > <https://reviews.apache.org/r/25525/diff/8/?file=717603#file717603line490> > > > > Instead of adding a 'update()' method on Slave struct, just manipulate > > the resources here. > > > > Also, you want to CHECK that the new resources are more than the old > > resources. I can't just manipulate the resources because the 'info' member is private, so I need a member function to update that (or make it public). Added a CHECK line. Practically the same available delta calculation works on the removal of resources. There is a slight possible hole when a resources is removed (not in new but is in old, the subtraction won't result in a "need to remove" entry, so available will not get the change) that I need to check for in later coding stages / when we get there. - Cody ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25525/#review56590 ----------------------------------------------------------- On Oct. 15, 2014, 3:13 a.m., Cody Maloney wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25525/ > ----------------------------------------------------------- > > (Updated Oct. 15, 2014, 3:13 a.m.) > > > Review request for mesos, Adam B and Vinod Kone. > > > Bugs: MESOS-1739 > https://issues.apache.org/jira/browse/MESOS-1739 > > > Repository: mesos-git > > > Description > ------- > > Allows attributes and resources to be set to a superset of what they were > previously on a slave restart. > > Incorporates all comments from: > https://issues.apache.org/jira/browse/MESOS-1739 > and the former review request: > https://reviews.apache.org/r/25111/ > > > Diffs > ----- > > src/Makefile.am d503c8df73cda15a9d59254e8265e4a5d0e003a4 > src/common/attributes.hpp 0a043d5b5dca804c6dd215cabd2704f24df71a33 > src/common/attributes.cpp aab114e1a5932e3f218b850e1afc7f2ef0f10e21 > src/common/slaveinfo_utils.hpp PRE-CREATION > src/common/slaveinfo_utils.cpp PRE-CREATION > src/master/allocator.hpp 02d20d0cc802805bc702891306aa42894531b223 > src/master/hierarchical_allocator_process.hpp > 31dfb2cc86e2e74da43f05fbada10976ce65f3e4 > src/master/master.hpp 14f1d0fd240c9cd0718d0238e1fbb9c733190205 > src/master/master.cpp 1b1ce0de3065ced45890777d42c69ef1091f5699 > src/slave/slave.cpp 0e342ed35e3db3b68f9f32b6cf4ace23e4a4db38 > src/tests/attributes_tests.cpp 240a8cac18ac12178cf73e8eeb88bd50e3fcc03b > src/tests/mesos.hpp 957e2233cc11c438fd80d3b6d1907a1223093104 > src/tests/slave_tests.cpp f585bdd20ae1af466f2c1b4d85331ac67451552f > > Diff: https://reviews.apache.org/r/25525/diff/ > > > Testing > ------- > > make check on localhost > > > Thanks, > > Cody Maloney > >
