----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/42988/#review117118 -----------------------------------------------------------
Fix it, then Ship it! Thanks for taking this on Neil! As we found out, this code is not the easiest to reason through. I left some issues for places we may be able to make it easier to read through the state assertions for the next set of readers. src/tests/group_tests.cpp (lines 445 - 446) <https://reviews.apache.org/r/42988/#comment178203> Can we add a short comment as to the state we're trying to achieve here? I think it will help readers of the test. src/tests/group_tests.cpp (lines 451 - 452) <https://reviews.apache.org/r/42988/#comment178204> Maybe a comment explaining that we're triggering the timeout? Or is this too self-explanatory? src/zookeeper/group.cpp (lines 128 - 137) <https://reviews.apache.org/r/42988/#comment178213> Not yours: Can we add a comment that we don't need to clean up the `delay` `Timer`s because they won't be invoked if libprocess can no longer get a `ProcessReference` to this Actor? src/zookeeper/group.cpp (line 154) <https://reviews.apache.org/r/42988/#comment178209> Should we s/promptly/within the sessionTimeout/ to be more clear? src/zookeeper/group.cpp (lines 154 - 159) <https://reviews.apache.org/r/42988/#comment178210> Some places we refer to `ZK` as in Zookeeper. Other places we refer to the handle `zk` as in the variable. This introduces a third `Zk`. Can we keep the code consistent with just the 2 names above? We could say either the `ZK handle` or the ``zk` handle`? Here and elsewhere in your patch. src/zookeeper/group.cpp (lines 365 - 366) <https://reviews.apache.org/r/42988/#comment178214> Can we explain that a timer always exists during a fresh connection, and a reconnect? Maybe we can point to a top level comment where you explain the DNS stale-ness problem. src/zookeeper/group.cpp (lines 367 - 368) <https://reviews.apache.org/r/42988/#comment178215> Comment along these lines: Once we are connected, we will be notified of a disconnect through the `reconnecting` callback, at which point we will re-establish a timer (per the DNS stale-ness issue). src/zookeeper/group.cpp (line 464) <https://reviews.apache.org/r/42988/#comment178216> Comment along the lines of: This assertion tests that we only receive a single `reconnecting` callback for the `connected -> disconnected` state transition in the zookeeper client. - Joris Van Remoortere On Jan. 30, 2016, 1:16 a.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/42988/ > ----------------------------------------------------------- > > (Updated Jan. 30, 2016, 1:16 a.m.) > > > Review request for mesos and Joris Van Remoortere. > > > Bugs: MESOS-4546 > https://issues.apache.org/jira/browse/MESOS-4546 > > > Repository: mesos > > > Description > ------- > > The previous implementation of `GroupProcess` tried to establish a single > ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will > retry internally, but it only retries by attempting to reconnect to a list of > previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup > updates to DNS configuration. Because DNS configuration can be quite dynamic, > we now close the current Zk handle and open a new one if we've seen a > successful `zookeeper_init` but haven't been connected within the ZooKeeper > session timeout. > > > Diffs > ----- > > src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d > src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2 > src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99 > src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8 > > Diff: https://reviews.apache.org/r/42988/diff/ > > > Testing > ------- > > make check, on both OSX and Arch Linux. Manually configured a situation in > which the Mesos agent uses stale DNS information in a loop: validated that > without the patch, we don't pickup DNS changes, whereas with the patch, we do. > > Also added a new unit test. Verified that the test fails w/o this patch > applied and passes deterministically (`gtest_repeat=100`) with the patch > applied. > > > Thanks, > > Neil Conway > >