----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/51278/#review146532 -----------------------------------------------------------
src/slave/containerizer/mesos/linux_launcher.hpp (lines 20 - 22) <https://reviews.apache.org/r/51278/#comment213036> Swap these two. src/slave/containerizer/mesos/linux_launcher.hpp (lines 64 - 85) <https://reviews.apache.org/r/51278/#comment213038> Swap these two definitions. According to google style, type definitions should be listed first. src/slave/containerizer/mesos/linux_launcher.hpp (lines 97 - 99) <https://reviews.apache.org/r/51278/#comment213039> Move this definition up below `recover` helper. src/slave/containerizer/mesos/linux_launcher.cpp (lines 131 - 145) <https://reviews.apache.org/r/51278/#comment213044> I suggest we move this to a helper. We already have a helper to get cgroup from containerId, this will be the reverse operation. ``` string getCgroup(const ContainerID& containerId); Try<ContainerID> getContainerId(const string& cgroup); ``` We can add necessary validation in the helper. src/slave/containerizer/mesos/linux_launcher.cpp (lines 149 - 151) <https://reviews.apache.org/r/51278/#comment213040> As we discussed yesterday, we probably should checkpoint the pid in runtime_dir before creating the cgroup in the freezer hierarchy. src/slave/containerizer/mesos/linux_launcher.cpp (lines 186 - 193) <https://reviews.apache.org/r/51278/#comment213052> The accumulation logic here is quite unintuitive. I'd suggest we make `cgroups::traverse` simple in the sense that it only applies a _map_ function. ``` template <typename T> Try<vector<T>> cgroups::traverse( const string& hierarchy, const string& cgroup, const lambda::function<T(const string&)>& f, const Traversal& traversal = PRE_ORDER); ``` And we perform the accumulation in LinuxLauncher. In that way, `recover` helper can be simplified to just return a `Try<Container>`. src/slave/containerizer/mesos/linux_launcher.cpp (lines 201 - 206) <https://reviews.apache.org/r/51278/#comment213054> I am wondering if should traverse the runtime_dir. Since we checkpoint the pid before we create the cgroup (i.e., we remove cgroup before we remove the checkpointed pid), we might run into the case where the checkpointed pid exists but the freezer cgroup does not exist. However, for legacy containers, we still have to traverse freezer hierarchy. src/slave/containerizer/mesos/linux_launcher.cpp (line 209) <https://reviews.apache.org/r/51278/#comment213071> s/our process/agent src/slave/containerizer/mesos/linux_launcher.cpp (line 223) <https://reviews.apache.org/r/51278/#comment213074> Why do you need this if check? src/slave/containerizer/mesos/linux_launcher.cpp (line 229) <https://reviews.apache.org/r/51278/#comment213073> s/our process/agent src/slave/containerizer/mesos/linux_launcher.cpp (lines 402 - 409) <https://reviews.apache.org/r/51278/#comment213075> When do we remove `Container` from `containers`? We cannot keep them forever as it'll cause a memory leak. I think we should remove it when the top level container is being destroyed, and we should do it _after_ the `cgroups::destory` is successful. src/slave/containerizer/mesos/linux_launcher.cpp (lines 408 - 409) <https://reviews.apache.org/r/51278/#comment213057> Can you be more specific here? Who will be responsible for cleaning up sub-directories under runtime_dir when the top level container terminates? I feel that Launcher should be responsible for doing that because it was the one that checkpoints the data. The caveat here is that we cannot remove checkpoint dir for nested containers until the top level container terminates. This is because the executor might issue `WAIT` call to the agent to get the exit status of the nested container. For the top level container, if we move the entire checkpointed dir during destroy, and if slave crashes before it checkpoints the sentinel file. During recovery, the containerizer will still call `launcher->wait` on the containerId that launcher does not know about. But since the agent does not need the exit code for the executor, this is fine for now. In the future, if we want to add executor status update, we need to adjust the logics here so that we only remove the top level checkpoint dir for the top level container after agent checkpoint the sentinel file. We probably need an explicit `launcher->cleanup`. src/slave/containerizer/mesos/linux_launcher.cpp (lines 414 - 419) <https://reviews.apache.org/r/51278/#comment213055> This sounds like an uncessary optimization that buy us nothing. This also creates an explicit dependency from launcher to pid namespace isolator. I'd suggest we simply just remove this code here (i.e., always use freezer). - Jie Yu On Aug. 22, 2016, 4:54 a.m., Benjamin Hindman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/51278/ > ----------------------------------------------------------- > > (Updated Aug. 22, 2016, 4:54 a.m.) > > > Review request for mesos and Jie Yu. > > > Repository: mesos > > > Description > ------- > > Note, this doesn't yet include the use of 'ns::enter' for creating > nested containers. > > > Diffs > ----- > > src/slave/containerizer/mesos/linux_launcher.hpp > 8fbe1e9742df85b0fc6de46ac81a0c064c845a63 > src/slave/containerizer/mesos/linux_launcher.cpp > 7377316776646e3d660086da3c0d5b494ce8ace4 > > Diff: https://reviews.apache.org/r/51278/diff/ > > > Testing > ------- > > > Thanks, > > Benjamin Hindman > >