Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67022/ --- (Updated Sept. 13, 2018, 10:17 a.m.) Review request for mesos, Eric Chung and Jason Lai. Changes --- Reset requiredMasterCapabilities.agentUpdate after successful refresh. Repository: mesos Description --- Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`. Diffs (updated) - src/slave/flags.cpp fd53d90776967ae97575140778129d6fddd726d2 src/slave/slave.cpp e6c7e686f287fb4448a0074d4e99298665fc866d Diff: https://reviews.apache.org/r/67022/diff/3/ Changes: https://reviews.apache.org/r/67022/diff/2-3/ Testing --- Thanks, Zhitao Li
Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.
> On Sept. 11, 2018, 10:35 a.m., Benno Evers wrote: > > src/slave/slave.cpp > > Lines 1607 (patched) > > <https://reviews.apache.org/r/67022/diff/1/?file=2018278#file2018278line1607> > > > > If I understand the slave code correctly, I think this would checkpoint > > the new state every time we're losing connection to a master (e.g. when the > > leader changes). > > > > I'd suggest moving it into `Slave::recover()`, i.e. the place where we > > set `requiredMasterCapabilities.agentUpdate` to true in the first place. The condition to set `requiredMasterCapabilities.agentUpdate` is: ``` // Check for SlaveInfo compatibility. Try _compatible = compatible(slaveState->info.get(), info); if (_compatible.isSome()) { // Permitted change, so we reuse the recovered agent id and reconnect // to running executors. // Prior to Mesos 1.5, the master expected that an agent would never // change its `SlaveInfo` and keep the same slave id, and therefore would // not update it's internal data structures on agent re-registration. if (!(slaveState->info.get() == info)) { requiredMasterCapabilities.agentUpdate = true; } ``` So if slave info did not change (i.e, a simple leader change), `requiredMasterCapabilities.agentUpdate` should be `false` and we would not update the checkpointed `SlaveInfo`. I confirmed that in our agent log, but we can also create a new unit test to cover that case. wdyt? - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67022/#review208520 ------- On May 8, 2018, 5:40 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67022/ > --- > > (Updated May 8, 2018, 5:40 p.m.) > > > Review request for mesos, Eric Chung and Jason Lai. > > > Repository: mesos > > > Description > --- > > Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`. > > > Diffs > - > > src/slave/flags.cpp a319b5ea633c41fd8a252c5e1617ac52d1480ba5 > src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 > > > Diff: https://reviews.apache.org/r/67022/diff/2/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Review Request 68300: Added `gpus` to failure message.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68300/ --- Review request for mesos, Gilbert Song and Jason Lai. Repository: mesos Description --- This message would be propagated all the way to scheduler, so making it explicit that this is related to gpu can speed up debugging. Diffs - src/slave/containerizer/mesos/isolators/gpu/allocator.cpp c288ad634b856702483b9751f41445308babd0c9 Diff: https://reviews.apache.org/r/68300/diff/1/ Testing --- Thanks, Zhitao Li
Review Request 68299: Documented new `--cgroups_destroy_timeout` agent option.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68299/ --- Review request for mesos, Gilbert Song and Jason Lai. Repository: mesos Description --- Documented new `--cgroups_destroy_timeout` agent option. Diffs - docs/configuration/agent.md 83b5fed5a8bf287700688507eaa584f37e8ba2b7 Diff: https://reviews.apache.org/r/68299/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68088/ --- (Updated Aug. 10, 2018, 10:36 a.m.) Review request for mesos, Gilbert Song, Jason Lai, and James Peach. Bugs: MESOS-8038 https://issues.apache.org/jira/browse/MESOS-8038 Repository: mesos Description (updated) --- The new agent flag can be used to reconfigure how long a container destroy is allowed to take on Mesos containerizer. Diffs (updated) - src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp e016d6dcff15bf49b003c9f3ca94477ce313f7b6 src/slave/containerizer/mesos/linux_launcher.cpp cd677cc652dc53ea3bf9a715b415c3e5c48c1f89 src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 Diff: https://reviews.apache.org/r/68088/diff/3/ Changes: https://reviews.apache.org/r/68088/diff/2-3/ Testing --- `make` and `./bin/mesos-slave.sh --help` Thanks, Zhitao Li
Re: Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68088/ --- (Updated July 30, 2018, 10:50 a.m.) Review request for mesos, Gilbert Song and Jason Lai. Changes --- Squash both diffs. Summary (updated) - Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag. Bugs: MESOS-8038 https://issues.apache.org/jira/browse/MESOS-8038 Repository: mesos Description (updated) --- The new agent flag can be used to reconfigure how long a container destroy is allowed to take on Mesos containerizer. The default is also increased to 5 min based on suggestion from Gilbert because certain containers could have deep system calls which may not finish within previous 1 min timeout. Diffs (updated) - src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 8fba6f9d335cc43a7cd0fcb51ed047ba0e7027ed src/slave/containerizer/mesos/linux_launcher.cpp 3bddcece7028745cec6623ac33dbfcaced629629 src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 Diff: https://reviews.apache.org/r/68088/diff/2/ Changes: https://reviews.apache.org/r/68088/diff/1-2/ Testing --- `make` and `./bin/mesos-slave.sh --help` Thanks, Zhitao Li
Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with agent flag.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68088/ --- Review request for mesos, Gilbert Song and Jason Lai. Bugs: MESOS-8038 https://issues.apache.org/jira/browse/MESOS-8038 Repository: mesos Description --- Replaced `cgroups::DESTROY_TIMEOUT` with agent flag. Diffs - src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 8fba6f9d335cc43a7cd0fcb51ed047ba0e7027ed src/slave/containerizer/mesos/linux_launcher.cpp 3bddcece7028745cec6623ac33dbfcaced629629 Diff: https://reviews.apache.org/r/68088/diff/1/ Testing --- `make` and `./bin/mesos-slave.sh --help` Thanks, Zhitao Li
Review Request 68087: Added new agent flag `cgroups_destroy_timeout`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68087/ --- Review request for mesos, Gilbert Song and Jason Lai. Bugs: MESOS-8038 https://issues.apache.org/jira/browse/MESOS-8038 Repository: mesos Description --- Added new agent flag `cgroups_destroy_timeout`. Diffs - src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 Diff: https://reviews.apache.org/r/68087/diff/1/ Testing --- `make` and run `./bin/mesos-slave.sh --help`. Thanks, Zhitao Li
Re: Review Request 67822: Avoid duplicate unmount dangling mount point.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67822/ --- (Updated July 3, 2018, 4:05 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Minor style fix Bugs: MESOS-9049 https://issues.apache.org/jira/browse/MESOS-9049 Repository: mesos Description --- We could potentially schedule the framework dir, executor dir, and executor run sandbox to gc at the same time, and then these paths will be gc independently, although they are parents and children directories. This patch makes sure we do not call unmount anymore after a success. Diffs (updated) - src/slave/gc.cpp 407f6b23f87cf2e2bdaf873c8adcda57f5d559b3 Diff: https://reviews.apache.org/r/67822/diff/2/ Changes: https://reviews.apache.org/r/67822/diff/1-2/ Testing --- ```make``` Thanks, Zhitao Li
Review Request 67822: Avoid duplicate unmount dangling mount point.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67822/ --- Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- We could potentially schedule the framework dir, executor dir, and executor run sandbox to gc at the same time, and then these paths will be gc independently, although they are parents and children directories. This patch makes sure we do not call unmount anymore after a success. Diffs - src/slave/gc.cpp 407f6b23f87cf2e2bdaf873c8adcda57f5d559b3 Diff: https://reviews.apache.org/r/67822/diff/1/ Testing --- ```make``` Thanks, Zhitao Li
Re: Review Request 53105: Added an hourly timer for `containerizer/docker/image_pull`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53105/ --- (Updated June 20, 2018, 9:49 p.m.) Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang. Changes --- Qian's comment. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- Added an hourly timer for `containerizer/docker/image_pull`. Diffs (updated) - src/slave/containerizer/docker.hpp 1ed47c85f765da6fb91a2a243fcc5030e9686190 src/slave/containerizer/docker.cpp f83036aa5a212d1d88e6f5aa41023abc125f7116 Diff: https://reviews.apache.org/r/53105/diff/7/ Changes: https://reviews.apache.org/r/53105/diff/6-7/ Testing --- Ran `mesos-execute` with a docker image and docker containerizer twice, observed the following metrics: ``` curl -s localhost:5051/metrics/snapshot | jq . "containerizer/docker/image_pull_ms": 6628.833024, "containerizer/docker/image_pull_ms/count": 2, "containerizer/docker/image_pull_ms/max": 8851.497216, "containerizer/docker/image_pull_ms/min": 6628.833024, "containerizer/docker/image_pull_ms/p50": 7740.16512, "containerizer/docker/image_pull_ms/p90": 8629.2307968, "containerizer/docker/image_pull_ms/p95": 8740.3640064, "containerizer/docker/image_pull_ms/p99": 8829.27057408, "containerizer/docker/image_pull_ms/p999": 8849.274551808, "containerizer/docker/image_pull_ms/p": 8851.2749495808, ``` Thanks, Zhitao Li
Review Request 67677: Added doc for new pull latency metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67677/ --- Review request for mesos, Jason Lai and Qian Zhang. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- Added doc for new pull latency metrics. Diffs - docs/monitoring.md 2985f68f644e84cdc5337c516ee06c95abd6018b Diff: https://reviews.apache.org/r/67677/diff/1/ Testing --- https://github.com/zhitaoli/mesos/blob/public/zhitao/time_containerizer_pulls/docs/monitoring.md#containerizers Thanks, Zhitao Li
Re: Review Request 66875: Added an hourly timer for docker store pull latency.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66875/ --- (Updated June 20, 2018, 11:26 a.m.) Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang. Changes --- Qian's comments. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- This patch added pull latency tracking for docker store, which can tell us both latency distribution of pull as well as number of pulls. Diffs (updated) - src/slave/containerizer/mesos/provisioner/docker/store.cpp 6e7dc44321bff3198e5ffe69be1ba329be3ee31e Diff: https://reviews.apache.org/r/66875/diff/3/ Changes: https://reviews.apache.org/r/66875/diff/2-3/ Testing (updated) --- Ran agent in command line and trigger several launches through `mesos-execute`, observed following metrics from agent endpoint: ``` "containerizer/mesos/provisioner/docker_store/image_pull_ms": 3619.53408, "containerizer/mesos/provisioner/docker_store/image_pull_ms/count": 2, "containerizer/mesos/provisioner/docker_store/image_pull_ms/max": 4208.662016, "containerizer/mesos/provisioner/docker_store/image_pull_ms/min": 3619.53408, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p50": 3914.098048, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p90": 4149.7492224, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p95": 4179.2056192, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p99": 4202.77073664, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p999": 4208.072888064, "containerizer/mesos/provisioner/docker_store/image_pull_ms/p": 4208.6031032064, ``` Thanks, Zhitao Li
Re: Review Request 53105: Added an hourly timer for `containerizer/docker/image_pull`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53105/ --- (Updated June 20, 2018, 11:25 a.m.) Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang. Changes --- Qian's comments. Summary (updated) - Added an hourly timer for `containerizer/docker/image_pull`. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description (updated) --- Added an hourly timer for `containerizer/docker/image_pull`. Diffs (updated) - src/slave/containerizer/docker.hpp 1ed47c85f765da6fb91a2a243fcc5030e9686190 src/slave/containerizer/docker.cpp f83036aa5a212d1d88e6f5aa41023abc125f7116 Diff: https://reviews.apache.org/r/53105/diff/6/ Changes: https://reviews.apache.org/r/53105/diff/5-6/ Testing (updated) --- Ran `mesos-execute` with a docker image and docker containerizer twice, observed the following metrics: ``` curl -s localhost:5051/metrics/snapshot | jq . "containerizer/docker/image_pull_ms": 6628.833024, "containerizer/docker/image_pull_ms/count": 2, "containerizer/docker/image_pull_ms/max": 8851.497216, "containerizer/docker/image_pull_ms/min": 6628.833024, "containerizer/docker/image_pull_ms/p50": 7740.16512, "containerizer/docker/image_pull_ms/p90": 8629.2307968, "containerizer/docker/image_pull_ms/p95": 8740.3640064, "containerizer/docker/image_pull_ms/p99": 8829.27057408, "containerizer/docker/image_pull_ms/p999": 8849.274551808, "containerizer/docker/image_pull_ms/p": 8851.2749495808, ``` Thanks, Zhitao Li
Re: Review Request 53105: Added an hourly timer for `slave/docker_containerizer/pull`.
> On June 20, 2018, 12:58 a.m., Qian Zhang wrote: > > src/slave/containerizer/docker.hpp > > Lines 140 (patched) > > <https://reviews.apache.org/r/53105/diff/5/?file=2014848#file2014848line140> > > > > It seems the convention in Mesos is to implement a `struct Metrics` for > > all the metrics rather than directly adding the metric here. And the name > > of the metric should be in underscore_case instead of camelCase, so I would > > suggest to name it `image_pull("containerizer/docker/image_pull", > > Hours(1))`. > > > > And is 1 hour too short for this metric? Ack for name and metrics struct On window: In our operational experience, if something goes wrong on the docker registry side, we typicaly detect that within an hour or so, so I would not make this much longer and lose sensitivity. If anyone needs a longer view, they can usually pipe the metric into another time series database and perform out of bound aggregation from there. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53105/#review205049 ------- On May 21, 2018, 10:16 a.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53105/ > --- > > (Updated May 21, 2018, 10:16 a.m.) > > > Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang. > > > Bugs: MESOS-6451 > https://issues.apache.org/jira/browse/MESOS-6451 > > > Repository: mesos > > > Description > --- > > Added an hourly timer for `slave/docker_containerizer/pull`. > > > Diffs > - > > src/slave/containerizer/docker.hpp 6bdf4c7f60384473927de437a061fafe94170557 > src/slave/containerizer/docker.cpp a4c9c10e91e75f406329cbb2086f39b549cbeed0 > > > Diff: https://reviews.apache.org/r/53105/diff/5/ > > > Testing > --- > > Ran `mesos-execute` with a docker image and docker containerizer, observed > the following metrics: > ``` > curl -s localhost:5051/metrics/snapshot | jq . | grep pull > "containerizer/docker/pull_ms/p999": 26209.36173824, > "containerizer/docker/pull_ms/p90": 21036.405248, > "containerizer/docker/pull_ms/p": 26256.388615424, > "containerizer/docker/pull_ms/p50": 135.570944, > "containerizer/docker/pull_ms/max": 26261.613824, > "containerizer/docker/pull_ms/p95": 23649.009536, > "containerizer/docker/pull_ms/min": 103.215872, > "containerizer/docker/pull_ms/p99": 25739.0929664, > "containerizer/docker/pull_ms": 26261.613824, > "containerizer/docker/pull_ms/count": 3 > ``` > > > Thanks, > > Zhitao Li > >
Re: Review Request 67575: Changed operator API to notify subscribers on every status change.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67575/#review204840 --- include/mesos/master/master.proto Lines 605-606 (original), 605-609 (patched) <https://reviews.apache.org/r/67575/#comment287580> Please also update comments in `include/mesos/v1/master/master.proto` include/mesos/master/master.proto Lines 606 (patched) <https://reviews.apache.org/r/67575/#comment287581> ``` // This is the latest state of the task according to the agent, // which can be more recent... ``` - Zhitao Li On June 15, 2018, 5:45 a.m., Benno Evers wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67575/ > --- > > (Updated June 15, 2018, 5:45 a.m.) > > > Review request for mesos, Alexander Rukletsov, Anand Mazumdar, Vinod Kone, > and Zhitao Li. > > > Bugs: MESOS-9000 > https://issues.apache.org/jira/browse/MESOS-9000 > > > Repository: mesos > > > Description > --- > > Prior to this change, the master would only send TaskUpdated messages > to subscribers when the latest known task state on the agent changed. > > This implied that schedulers could not reliably wait for the status > information corresponding to specific state updates (i.e. TASK_RUNNING), > since there is no guarantee that subscribers get notified during > the time when this status update will be included in the status field. > > After this change, TaskUpdate messages are sent whenever the latest > acknowledged state of the task changes. > > > Diffs > - > > include/mesos/master/master.proto 54f84120728eea7995422b9c356ed67e5b054623 > src/master/master.cpp 5db5a8da85f02323a5654c93ac47ec4aa7e711d2 > > > Diff: https://reviews.apache.org/r/67575/diff/1/ > > > Testing > --- > > `make check` > > > Thanks, > > Benno Evers > >
Re: Review Request 67423: Skipped metric for non existing paths in gc.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67423/ --- (Updated June 12, 2018, 1:23 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Repository: mesos Description --- Previously, agent gc would increment the "failed" counter if the path does not exist, but this should not be an issue. This patch skipped such paths in both "failed" and "succeeded" counters. Diffs (updated) - src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 Diff: https://reviews.apache.org/r/67423/diff/3/ Changes: https://reviews.apache.org/r/67423/diff/2-3/ Testing --- Thanks, Zhitao Li
Re: Review Request 67421: Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67421/ --- (Updated June 12, 2018, 1:23 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- The current `ROOT_BusyMountPoint` test would fail because we added support for unmounting dangling mount points in directory to gc. This patch rewrote this test to reflect that after unmounting, the gc succeeded, directory was gone and metrics were correctly reported. Diffs (updated) - src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 Diff: https://reviews.apache.org/r/67421/diff/5/ Changes: https://reviews.apache.org/r/67421/diff/4-5/ Testing --- Thanks, Zhitao Li
Re: Review Request 67264: Unmounted any mount points in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated June 12, 2018, 1:22 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Jie's comments. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). Currently, the only mounts in the host mount namespace under the sandbox directories are persistent volumes, so this diff added protection to unmount any dangling mount points before calling `rmdir` on the directory. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed Diff: https://reviews.apache.org/r/67264/diff/8/ Changes: https://reviews.apache.org/r/67264/diff/7-8/ Testing --- Added a unit test in following patch. Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67264: Unmounted any mount points in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated June 11, 2018, 4:04 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Use adaptor::reverse Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). Currently, the only mounts in the host mount namespace under the sandbox directories are persistent volumes, so this diff added protection to unmount any dangling mount points before calling `rmdir` on the directory. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed Diff: https://reviews.apache.org/r/67264/diff/7/ Changes: https://reviews.apache.org/r/67264/diff/6-7/ Testing --- Added a unit test in following patch. Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67264: Unmounted any mount points in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated June 11, 2018, 1:58 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Fail `rmdir` future if mount table cannot be loaded for agent on Linux. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). Currently, the only mounts in the host mount namespace under the sandbox directories are persistent volumes, so this diff added protection to unmount any dangling mount points before calling `rmdir` on the directory. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed Diff: https://reviews.apache.org/r/67264/diff/6/ Changes: https://reviews.apache.org/r/67264/diff/5-6/ Testing --- Added a unit test in following patch. Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67421: Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67421/ --- (Updated June 5, 2018, 5:08 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Updated test cases. Summary (updated) - Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description (updated) --- The current `ROOT_BusyMountPoint` test would fail because we added support for unmounting dangling mount points in directory to gc. This patch rewrote this test to reflect that after unmounting, the gc succeeded, directory was gone and metrics were correctly reported. Diffs (updated) - src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 Diff: https://reviews.apache.org/r/67421/diff/3/ Changes: https://reviews.apache.org/r/67421/diff/2-3/ Testing --- Thanks, Zhitao Li
Re: Review Request 67264: Unmounted any mount points in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated June 5, 2018, 5:07 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Jie's comments. Summary (updated) - Unmounted any mount points in gc paths. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description (updated) --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). Currently, the only mounts in the host mount namespace under the sandbox directories are persistent volumes, so this diff added protection to unmount any dangling mount points before calling `rmdir` on the directory. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed Diff: https://reviews.apache.org/r/67264/diff/5/ Changes: https://reviews.apache.org/r/67264/diff/4-5/ Testing --- Added a unit test in following patch. Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67421: Added a test for verifying GC can handle dangling persistent volume.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67421/ --- (Updated June 1, 2018, 10:31 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Fix test name and failed counter value Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- Added a test for verifying GC can handle dangling persistent volume. Diffs (updated) - src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 Diff: https://reviews.apache.org/r/67421/diff/2/ Changes: https://reviews.apache.org/r/67421/diff/1-2/ Testing --- Thanks, Zhitao Li
Review Request 67423: Skipped metric for non existing paths in gc.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67423/ --- Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Repository: mesos Description --- Previously, agent gc would increment the "failed" counter if the path does not exist, but this should not be an issue. This patch skipped such paths in both "failed" and "succeeded" counters. Diffs - src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 Diff: https://reviews.apache.org/r/67423/diff/1/ Testing --- Thanks, Zhitao Li
Review Request 67421: Added a test for verifying GC can handle dangling persistent volume.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67421/ --- Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- Added a test for verifying GC can handle dangling persistent volume. Diffs - src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 Diff: https://reviews.apache.org/r/67421/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 67264: Unmounted any dangling persistent volume in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated May 31, 2018, 1:38 p.m.) Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu. Changes --- Jason's comments. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). This patch added some protection to unmount possible persistent volumes inside a path to gc, and skipped the path if unmount failed. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 src/tests/cluster.cpp b56212f6529a4d307e65797ad9bb34f2104fc832 src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 65d860594572b58a50a89358e31e97fd2a10bf08 Diff: https://reviews.apache.org/r/67264/diff/3/ Changes: https://reviews.apache.org/r/67264/diff/2-3/ Testing --- Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67320: Sent task (health) check updates over the operator streaming API.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67320/#review204023 --- Ship it! Ship It! - Zhitao Li On May 28, 2018, 4:25 a.m., Alexander Rukletsov wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67320/ > --- > > (Updated May 28, 2018, 4:25 a.m.) > > > Review request for mesos and Zhitao Li. > > > Bugs: MESOS-8942 > https://issues.apache.org/jira/browse/MESOS-8942 > > > Repository: mesos > > > Description > --- > > With this patch subscribers to the master operator streaming API > start receiving task health (check) updates. This allows subscribers > to maintain more accurate view of the cluster's state, closer to > what the traditional `state.json` endpoint offers. > > > Diffs > - > > src/master/master.cpp 41862db9900acde85a62d2fea85459691c68556e > src/tests/api_tests.cpp 84368707e2c0bcf66bbfb308a4b863112119d328 > > > Diff: https://reviews.apache.org/r/67320/diff/2/ > > > Testing > --- > > make check on > * Mac OS 10.13.4 > * various linux distros > > > Thanks, > > Alexander Rukletsov > >
Re: Review Request 67320: Sent task (health) check updates over the operator streaming API.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67320/#review203885 --- src/tests/api_tests.cpp Lines 2386 (patched) <https://reviews.apache.org/r/67320/#comment286238> We typically do not do `this->`. `Try> master = StartMaster();` - Zhitao Li On May 25, 2018, 6:52 a.m., Alexander Rukletsov wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67320/ > --- > > (Updated May 25, 2018, 6:52 a.m.) > > > Review request for mesos and Zhitao Li. > > > Bugs: MESOS-8942 > https://issues.apache.org/jira/browse/MESOS-8942 > > > Repository: mesos > > > Description > --- > > With this patch subscribers to the master operator streaming API > start receiving task health (check) updates. This allows subscribers > to maintain more accurate view of the cluster's state, closer to > what the traditional `state.json` endpoint offers. > > > Diffs > - > > src/master/master.cpp 41862db9900acde85a62d2fea85459691c68556e > src/tests/api_tests.cpp 84368707e2c0bcf66bbfb308a4b863112119d328 > > > Diff: https://reviews.apache.org/r/67320/diff/1/ > > > Testing > --- > > make check on > * Mac OS 10.13.4 > * various linux distros > > > Thanks, > > Alexander Rukletsov > >
Re: Review Request 67264: Unmounted any dangling persistent volume in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- (Updated May 24, 2018, 12:48 p.m.) Review request for mesos, Jason Lai and Jie Yu. Changes --- Fix test compilation due to changed constructor interface. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). This patch added some protection to unmount possible persistent volumes inside a path to gc, and skipped the path if unmount failed. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs (updated) - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 src/tests/cluster.cpp b56212f6529a4d307e65797ad9bb34f2104fc832 src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 src/tests/slave_tests.cpp 65d860594572b58a50a89358e31e97fd2a10bf08 Diff: https://reviews.apache.org/r/67264/diff/2/ Changes: https://reviews.apache.org/r/67264/diff/1-2/ Testing --- Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Review Request 67264: Unmounted any dangling persistent volume in gc paths.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67264/ --- Review request for mesos, Jason Lai and Jie Yu. Bugs: MESOS-8830 https://issues.apache.org/jira/browse/MESOS-8830 Repository: mesos Description --- In various corner cases, agent may not get chance to properly unmount persistent volumes mounted inside an executor's sandbox. When GC later gets to these sandbox directories, permanent data loss can happen (see MESOS-8830). This patch added some protection to unmount possible persistent volumes inside a path to gc, and skipped the path if unmount failed. NOTE: this means agent will not garbage collect any path if it cannot read its own `mountinfo` table. Diffs - src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 Diff: https://reviews.apache.org/r/67264/diff/1/ Testing --- Tested with following procedures: 1. Start a test master and agent; 2. Created a persistent volume on agent through operator API; 3. Use `mesos-execute` to run a task; 4. Stop the agent; 5. Manually bind mount persistent volume path into a `volume` directory inside the executor sandbox (to simulate a dangling mount in MESOS-8830); 6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc the path immediately. With this fix, we observed that the dangling mount is automatically cleaned up, and agent produces log line: ``` W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume' of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' inside garbage collected path '/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0' ``` Thanks, Zhitao Li
Re: Review Request 67154: Replaced `RpcResult` with `Try`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67154/#review203271 --- 3rdparty/libprocess/include/process/grpc.hpp Lines 122 (patched) <https://reviews.apache.org/r/67154/#comment285376> This function returns a `Future` of a `Try` ... 3rdparty/libprocess/include/process/grpc.hpp Lines 125-126 (patched) <https://reviews.apache.org/r/67154/#comment285377> Do we need to call out what case the `Future` itself can be failed? - Zhitao Li On May 16, 2018, 12:20 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67154/ > --- > > (Updated May 16, 2018, 12:20 p.m.) > > > Review request for mesos, Benjamin Bannier, Benjamin Mahler, Jie Yu, and > Zhitao Li. > > > Bugs: MESOS-8924 > https://issues.apache.org/jira/browse/MESOS-8924 > > > Repository: mesos > > > Description > --- > > The `process::grpc::client::Runtime::call` method currently returns a > `RpcResult`, which contains both a `::grpc::Status` object > and the resulting response protobuf. However, if the `::grpc::Status` > represents a non-OK status, the gRPC library does not guarantee that > the response protobuf is valid. This patch replaces `RpcResult` with > `Try` to provide better type safety. > > > Diffs > - > > 3rdparty/libprocess/include/process/grpc.hpp > 321a46e19c69eafb24012bcef68bb8b0cc6aa436 > 3rdparty/libprocess/src/tests/grpc_tests.cpp > 38cd6c61b54518a1019bb11a3551be13026c3f0d > > > Diff: https://reviews.apache.org/r/67154/diff/1/ > > > Testing > --- > > make check in libprocess > > NOTE: Mesos cannot be built with this patch standalone. The tests are done > later in the chain. > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67022/ --- (Updated May 8, 2018, 5:40 p.m.) Review request for mesos, Eric Chung and Jason Lai. Repository: mesos Description --- Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`. Diffs (updated) - src/slave/flags.cpp a319b5ea633c41fd8a252c5e1617ac52d1480ba5 src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 Diff: https://reviews.apache.org/r/67022/diff/2/ Changes: https://reviews.apache.org/r/67022/diff/1-2/ Testing --- Thanks, Zhitao Li
Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67022/ --- Review request for mesos, Benno Evers, Jason Lai, and Vinod Kone. Repository: mesos Description --- Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`. Diffs - src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 Diff: https://reviews.apache.org/r/67022/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 64384: Added new 'any' setting for reconfiguration_policy flag.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64384/#review202713 --- src/slave/slave.cpp Lines 6373-6375 (original), 6377-6379 (patched) <https://reviews.apache.org/r/64384/#comment284699> I noticed that we do not refresh the checkpointed information of agent, but simply proceed to run a different `AgentInfo` from what's left on the disk. Do we worry that this could cause a confusion in the future? I wonder whether we should augment the behavior for `any` to also flush out changed `AgentInfo` using `state::checkpoint()`. Thoughts? - Zhitao Li On Dec. 6, 2017, 8:22 a.m., Benno Evers wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64384/ > --- > > (Updated Dec. 6, 2017, 8:22 a.m.) > > > Review request for mesos. > > > Repository: mesos > > > Description > --- > > This setting allows any state change, effectively telling agents to ignore > the existing state and not to run any new tasks until the existing ones fit > into the new provided resources. > > > Diffs > - > > src/slave/compatibility.hpp 78b421a01abe5d2178c93832577577a7ba282b38 > src/slave/compatibility.cpp 4ead4a5b655f6f3d7812aa52d656830d7cff4598 > src/slave/flags.cpp d8764745e6aca81283d8b96388df1320c3465952 > src/slave/slave.cpp 49270013537356c8fe9150d757b064bc3bbae3cb > > > Diff: https://reviews.apache.org/r/64384/diff/1/ > > > Testing > --- > > > Thanks, > > Benno Evers > >
Re: Review Request 66993: Removed unnecessary expectation on termination.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66993/ --- (Updated May 8, 2018, 1:36 p.m.) Review request for mesos, Andrei Budnik and James Peach. Bugs: MESOS-8884 https://issues.apache.org/jira/browse/MESOS-8884 Repository: mesos Description --- This test was flaky because termination could already happened when we set up the expectation. Given that we already verified task state, I do not see checking container termination explicitly is necessary, so removing the expectation should fix the flakiness. Diffs (updated) - src/tests/containerizer/docker_containerizer_tests.cpp d834e531a550028cd57ac409c9312dd22138e8d5 Diff: https://reviews.apache.org/r/66993/diff/2/ Changes: https://reviews.apache.org/r/66993/diff/1-2/ Testing --- GTEST_FILTER="DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime" ./bin/mesos-tests.sh --gtest_repeat=100 --verbose Thanks, Zhitao Li
Review Request 66993: Removed unnecessary expectation on termination.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66993/ --- Review request for mesos, Andrei Budnik and James Peach. Bugs: MESOS-8884 https://issues.apache.org/jira/browse/MESOS-8884 Repository: mesos Description --- This test was flaky because termination could already happened when we set up the expectation. Given that we already verified task state, I do not see checking container termination explicitly is necessary, so removing the expectation should fix the flakiness. Diffs - src/tests/containerizer/docker_containerizer_tests.cpp d834e531a550028cd57ac409c9312dd22138e8d5 Diff: https://reviews.apache.org/r/66993/diff/1/ Testing --- GTEST_FILTER="DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime" ./bin/mesos-tests.sh --gtest_repeat=100 --verbose Thanks, Zhitao Li
Re: Review Request 66923: Added documentation on volume resize support.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66923/ --- (Updated May 4, 2018, 9:49 a.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Review comments. Repository: mesos Description --- Added documentation on volume resize support. Diffs (updated) - docs/authorization.md fdbef770c38ab38ab748b4f943b331d4442a2ce1 docs/operator-http-api.md 9be1e2db60562b369bdc6ee2c732a74fad049580 docs/persistent-volume.md 1a5799b8d8e5302e0b01da6b9a16a3b9fb75c898 Diff: https://reviews.apache.org/r/66923/diff/2/ Changes: https://reviews.apache.org/r/66923/diff/1-2/ Testing --- https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/persistent-volume.md https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/operator-http-api.md https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/authorization.md Thanks, Zhitao Li
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
> On May 2, 2018, 11:51 a.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 1289 (patched) > > <https://reviews.apache.org/r/66532/diff/8/?file=2015784#file2015784line1289> > > > > I don't think we need this but it doesn't hurt. I'll leave it up to you > > to decide if you want to keep this. Please feel free to drop it if you > > prefer keeping it. > > > > This is more of a personal preference ;) I prefer always introducing > > synchronizations for a reason but some others prefer making tests as > > deterministic as possible. I think I prefer keeping it for now. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/#review202300 ------- On May 1, 2018, 3:55 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66532/ > --- > > (Updated May 1, 2018, 3:55 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8748 > https://issues.apache.org/jira/browse/MESOS-8748 > > > Repository: mesos > > > Description > --- > > Added test for authorization actions for `RESIZE_VOLUME`. > > > Diffs > - > > src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66532/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66920: Improved tests for resizing persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66920/#review202355 --- src/tests/persistent_volume_tests.cpp Lines 604 (patched) <https://reviews.apache.org/r/66920/#comment284139> Technically we only tested that persistent volume and its data can still be accessed. I do not know a reliable and efficient way to test the volume size actually changed after resizing. Maybe clarify the comment? src/tests/persistent_volume_tests.cpp Line 751 (original), 807 (patched) <https://reviews.apache.org/r/66920/#comment284140> ``` This test verifies that launching any task depending either origial or grown volume of a `GROW_VOLUME` call in the same `acceptOffers` will be dropped... ``` src/tests/persistent_volume_tests.cpp Line 874 (original), 942 (patched) <https://reviews.apache.org/r/66920/#comment284141> ``` This test verifies that launching any task depending either origial or shrunk volume of a `SHRINK_VOLUME` call in the same `acceptOffers` will be dropped... ``` - Zhitao Li On May 2, 2018, 9:54 p.m., Chun-Hung Hsiao wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66920/ > --- > > (Updated May 2, 2018, 9:54 p.m.) > > > Review request for mesos, Greg Mann and Zhitao Li. > > > Repository: mesos > > > Description > --- > > Now the `GrowVolume` and `ShrinkVolume` tests launch tasks after > resizing the volumes to ensure that the operations take effect on > agents. The `NonSpeculativeGrowAndLaunch` and > `NonSpeculativeShrinkAndLaunch` tests launch an additional task to > verify that the original volume consumed by the operations cannot be > used by subsequent tasks. > > This patch also adjusted when the clock is resumed so schedulers will > not receive unexpected offers. > > > Diffs > - > > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66920/diff/2/ > > > Testing > --- > > sudo make check > > > Thanks, > > Chun-Hung Hsiao > >
Re: Review Request 66227: Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66227/ --- (Updated May 2, 2018, 5:06 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Run test on Windows too. Bugs: MESOS-8747 https://issues.apache.org/jira/browse/MESOS-8747 Repository: mesos Description --- Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API. Diffs (updated) - src/tests/api_tests.cpp dd8e221d8fd1b2a241505345337897e4ee4a6347 Diff: https://reviews.apache.org/r/66227/diff/9/ Changes: https://reviews.apache.org/r/66227/diff/8-9/ Testing --- Thanks, Zhitao Li
Review Request 66923: Added documentation on volume resize support.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66923/ --- Review request for mesos, Chun-Hung Hsiao and Greg Mann. Repository: mesos Description --- Added documentation on volume resize support. Diffs - docs/authorization.md fdbef770c38ab38ab748b4f943b331d4442a2ce1 docs/operator-http-api.md 9be1e2db60562b369bdc6ee2c732a74fad049580 docs/persistent-volume.md 1a5799b8d8e5302e0b01da6b9a16a3b9fb75c898 Diff: https://reviews.apache.org/r/66923/diff/1/ Testing --- https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/persistent-volume.md https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/operator-http-api.md https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/authorization.md Thanks, Zhitao Li
Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/ --- (Updated May 2, 2018, 2:16 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- Added tests for validation of `GrowVolume` and `ShrinkVolume`. Diffs (updated) - src/tests/master_validation_tests.cpp a5229610a81dfac9d358597135b63ee51de86659 Diff: https://reviews.apache.org/r/66858/diff/6/ Changes: https://reviews.apache.org/r/66858/diff/5-6/ Testing --- Thanks, Zhitao Li
Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/ --- (Updated May 2, 2018, 2:14 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- Added tests for validation of `GrowVolume` and `ShrinkVolume`. Diffs (updated) - src/tests/master_validation_tests.cpp a5229610a81dfac9d358597135b63ee51de86659 Diff: https://reviews.apache.org/r/66858/diff/5/ Changes: https://reviews.apache.org/r/66858/diff/4-5/ Testing --- Thanks, Zhitao Li
Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/ --- (Updated May 2, 2018, 12:01 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Added test for grow on MOUNT. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- Added tests for validation of `GrowVolume` and `ShrinkVolume`. Diffs (updated) - src/tests/master_validation_tests.cpp a5229610a81dfac9d358597135b63ee51de86659 Diff: https://reviews.apache.org/r/66858/diff/4/ Changes: https://reviews.apache.org/r/66858/diff/3-4/ Testing --- Thanks, Zhitao Li
Re: Review Request 66227: Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66227/ --- (Updated May 2, 2018, 10:28 a.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments. Bugs: MESOS-8747 https://issues.apache.org/jira/browse/MESOS-8747 Repository: mesos Description --- Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API. Diffs (updated) - src/tests/api_tests.cpp dd8e221d8fd1b2a241505345337897e4ee4a6347 Diff: https://reviews.apache.org/r/66227/diff/7/ Changes: https://reviews.apache.org/r/66227/diff/6-7/ Testing --- Thanks, Zhitao Li
Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
> On May 1, 2018, 7:18 p.m., Chun-Hung Hsiao wrote: > > src/tests/master_validation_tests.cpp > > Lines 1657 (patched) > > <https://reviews.apache.org/r/66858/diff/2/?file=2015777#file2015777line1657> > > > > Hmm... we don't have this check when validating `GrowVolume` because we > > are counting on the fact that adding anything disk to a MOUNT disk will > > result in two resources. But I guess for the validation test maybe it's a > > good idea to also test this for `GrowVolume`. What do you think? MOUNT would actually pass here in current implementation. As we discussed, I think it may be a good idea to explicitly validate this then test it out. Let me know what you think. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/#review202251 ----------- On May 2, 2018, 10:24 a.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66858/ > --- > > (Updated May 2, 2018, 10:24 a.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-4945 > https://issues.apache.org/jira/browse/MESOS-4945 > > > Repository: mesos > > > Description > --- > > Added tests for validation of `GrowVolume` and `ShrinkVolume`. > > > Diffs > - > > src/tests/master_validation_tests.cpp > a5229610a81dfac9d358597135b63ee51de86659 > > > Diff: https://reviews.apache.org/r/66858/diff/3/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/ --- (Updated May 2, 2018, 10:24 a.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Rebase and review comments. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- Added tests for validation of `GrowVolume` and `ShrinkVolume`. Diffs (updated) - src/tests/master_validation_tests.cpp a5229610a81dfac9d358597135b63ee51de86659 Diff: https://reviews.apache.org/r/66858/diff/3/ Changes: https://reviews.apache.org/r/66858/diff/2-3/ Testing --- Thanks, Zhitao Li
Re: Review Request 66051: Implemented operator API to grow and shrink persistent volume.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66051/ --- (Updated May 1, 2018, 3:56 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments. Bugs: MESOS-8747 https://issues.apache.org/jira/browse/MESOS-8747 Repository: mesos Description --- These operator APIs is implemented as speculative for now, but we plan to convert them to non-speculative in the future. Diffs (updated) - src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad src/master/master.hpp c30cf082d9ddf2e5dd97e0d887d6d07d8e03cbd4 src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d Diff: https://reviews.apache.org/r/66051/diff/16/ Changes: https://reviews.apache.org/r/66051/diff/15-16/ Testing --- Thanks, Zhitao Li
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/#review202238 --- src/tests/persistent_volume_tests.cpp Line 1212 (original), 1133 (patched) <https://reviews.apache.org/r/66532/#comment283990> For my self: consistency. - Zhitao Li On May 1, 2018, 3:45 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66532/ > --- > > (Updated May 1, 2018, 3:45 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8748 > https://issues.apache.org/jira/browse/MESOS-8748 > > > Repository: mesos > > > Description > --- > > Added test for authorization actions for `RESIZE_VOLUME`. > > > Diffs > - > > src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66532/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/ --- (Updated May 1, 2018, 3:55 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Several more review comments. Bugs: MESOS-8748 https://issues.apache.org/jira/browse/MESOS-8748 Repository: mesos Description --- Added test for authorization actions for `RESIZE_VOLUME`. Diffs (updated) - src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66532/diff/8/ Changes: https://reviews.apache.org/r/66532/diff/7-8/ Testing --- Thanks, Zhitao Li
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/ --- (Updated May 1, 2018, 3:44 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Several more review comments. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. Diffs (updated) - src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66220/diff/10/ Changes: https://reviews.apache.org/r/66220/diff/9-10/ Testing --- Thanks, Zhitao Li
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/ --- (Updated May 1, 2018, 3:45 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Review comments. Bugs: MESOS-8748 https://issues.apache.org/jira/browse/MESOS-8748 Repository: mesos Description --- Added test for authorization actions for `RESIZE_VOLUME`. Diffs (updated) - src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66532/diff/7/ Changes: https://reviews.apache.org/r/66532/diff/6-7/ Testing --- Thanks, Zhitao Li
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/ --- (Updated May 1, 2018, 3:37 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. Diffs (updated) - src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66220/diff/9/ Changes: https://reviews.apache.org/r/66220/diff/8-9/ Testing --- Thanks, Zhitao Li
Re: Review Request 66050: Implemented grow and shrink of persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66050/ --- (Updated May 1, 2018, 3:35 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- The new offer operations are implemented as speculative operations, but we will use validation to make them non-speculative on API level so that we can transition later without a breaking change. Diffs (updated) - src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d Diff: https://reviews.apache.org/r/66050/diff/15/ Changes: https://reviews.apache.org/r/66050/diff/14-15/ Testing --- Thanks, Zhitao Li
Re: Review Request 66049: Added offer operation to grow and shrink persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66049/ --- (Updated May 1, 2018, 3:34 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Add TODO. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- Added offer operation to grow and shrink persistent volumes. Diffs (updated) - include/mesos/mesos.proto 5bc4a80ad5278d20feced73dce755519515de505 include/mesos/v1/mesos.proto 5a4e733250831fa5fe86544aeb98dbc0a4f5afa6 src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/resource_provider/storage/provider.cpp d1267cf8df06cc758c47d47535b868a3ac741a0b src/tests/mesos.hpp 756a52189ea3336b203ea1d606e2ba17762d57aa src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a src/tests/reservation_tests.cpp 5570df2e0b208512d0a0a3079a180e1acfe08f3d Diff: https://reviews.apache.org/r/66049/diff/13/ Changes: https://reviews.apache.org/r/66049/diff/12-13/ Testing --- Thanks, Zhitao Li
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
> On April 30, 2018, 9:05 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 1256 (patched) > > <https://reviews.apache.org/r/66532/diff/6/?file=2014384#file2014384line1256> > > > > Suggestion: Start the first framework with `DEFAULT_CREDENTIAL`, to be > > consistent with your comment for the second framework. This is implicit by using `DEFAULT_FRAMEWORK_INFO`. Are you suggesting to call set_principal() explicitly? - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/#review202170 ----------- On April 27, 2018, 1:08 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66532/ > --- > > (Updated April 27, 2018, 1:08 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8748 > https://issues.apache.org/jira/browse/MESOS-8748 > > > Repository: mesos > > > Description > --- > > Added test for authorization actions for `RESIZE_VOLUME`. > > > Diffs > - > > src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66532/diff/6/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
> On April 30, 2018, 9:05 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 1464 (patched) > > <https://reviews.apache.org/r/66532/diff/6/?file=2014384#file2014384line1464> > > > > No need to resume the clock here, as the fixture teardown will resume > > it. However, I'm not sure if you need to resume the clock before stopping > > and joining `driver2`. I think enough tests explicitly resume the clock, so I'd like to keep it unless you have issue with it. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/#review202170 ----------- On April 27, 2018, 1:08 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66532/ > --- > > (Updated April 27, 2018, 1:08 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8748 > https://issues.apache.org/jira/browse/MESOS-8748 > > > Repository: mesos > > > Description > --- > > Added test for authorization actions for `RESIZE_VOLUME`. > > > Diffs > - > > src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66532/diff/6/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> On April 30, 2018, 5:20 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 459 (patched) > > <https://reviews.apache.org/r/66220/diff/8/?file=2014377#file2014377line459> > > > > Suggestion for the TODO: Make `MOUNT` a meaningful parameter value for > > this test, or create a new fixture to avoid testing against it. > > > > BTW, would it be hard to do what you did for `MOUNT` in the > > `ShrinkVolume` test? Yes it is harder here because there is not conceivable way for a framework to receive both a `MOUNT` volume and free disk resource on the same `MOUNT`, so framework cannot even construct such operation from offered resources. > On April 30, 2018, 5:20 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 837 (patched) > > <https://reviews.apache.org/r/66220/diff/8/?file=2014377#file2014377line837> > > > > To make this test more succinct, how about doing > > `{CREATE(volume), GROW_VOLUME(volume, addition), LAUNCH(task)}` so we > > don't need the subsequent `acceptOffers`? I like this suggestion. will try it out. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/#review202157 --- On April 27, 2018, 1:04 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66220/ > --- > > (Updated April 27, 2018, 1:04 p.m.) > > > Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. > > > Diffs > - > > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66220/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66293: Tested default executor support of `max_completion_time`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66293/ --- (Updated May 1, 2018, 9:43 a.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Review comments. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- Tested default executor support of `max_completion_time`. Diffs (updated) - src/tests/default_executor_tests.cpp bf849c4b636e81ec267112bff9621579998941f5 Diff: https://reviews.apache.org/r/66293/diff/6/ Changes: https://reviews.apache.org/r/66293/diff/5-6/ Testing --- Thanks, Zhitao Li
Re: Review Request 66284: Tested `max_completion_time` support in docker executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66284/ --- (Updated May 1, 2018, 9:42 a.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Review comments. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- Tested `max_completion_time` support in docker executor. Diffs (updated) - src/tests/containerizer/docker_containerizer_tests.cpp 5e3dfdb537f52a85a82c6e8195e70a389b76aadf Diff: https://reviews.apache.org/r/66284/diff/5/ Changes: https://reviews.apache.org/r/66284/diff/4-5/ Testing --- Thanks, Zhitao Li
Re: Review Request 66291: Added support to `max_completion_time` in default executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66291/ --- (Updated May 1, 2018, 9:42 a.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Review comments. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- When a task group has multiple tasks: - each task can have its own `max_completion_time`, or not have one; - if a task succeeds before its `max_completion_time`, all other tasks will keep running; - if a task fails, all other tasks in the same group will fail (as before); - if a task does not succeed before its `max_completion_time`, it will fail with `TASK_FAILED` and reason `REASON_MAX_COMPLETION_TIME_REACHED`, while other tasks will be killed without above reason. Diffs (updated) - src/launcher/default_executor.cpp ea0d425bb60e970f209f411632e1d486c279259b Diff: https://reviews.apache.org/r/66291/diff/4/ Changes: https://reviews.apache.org/r/66291/diff/3-4/ Testing --- Thanks, Zhitao Li
Re: Review Request 66283: Added support of `max_completion_time` in docker executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66283/ --- (Updated May 1, 2018, 9:42 a.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Review comments. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- If `TaskInfo.max_completion_time` is set, docker executor will kill the task immediately. We reuse the `shutdown` method to achieve a forced kill ignoring any `KillPolicy`. Framework should only received a `TASK_FAILED` state with `REASON_MAX_COMPLETION_TIME_REACHED` reason. Diffs (updated) - src/docker/executor.cpp 1d672112bae889cc9d19343a59e0ff66569785c4 Diff: https://reviews.apache.org/r/66283/diff/5/ Changes: https://reviews.apache.org/r/66283/diff/4-5/ Testing --- Thanks, Zhitao Li
Re: Review Request 66259: Added `max_completion_time` support to command executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66259/ --- (Updated May 1, 2018, 9:41 a.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Review comments. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- If `TaskInfo.max_completion_time` is set, command executor will kill the task with `SIGKILL` immediately. Note that no KillPolicy will be observed. Framework should only received a `TASK_FAILED` state with `REASON_MAX_COMPLETION_TIME_REACHED` reason. Diffs (updated) - src/launcher/executor.cpp 8d0869cfcdfc693301d0e185a036e97204bad17f Diff: https://reviews.apache.org/r/66259/diff/6/ Changes: https://reviews.apache.org/r/66259/diff/5-6/ Testing --- Thanks, Zhitao Li
Re: Review Request 66828: Introduced a push-based gauge metric.
> On May 1, 2018, 9:25 a.m., Zhitao Li wrote: > > 3rdparty/libprocess/include/process/metrics/push_gauge.hpp > > Lines 70-75 (patched) > > <https://reviews.apache.org/r/66828/diff/1/?file=2013207#file2013207line70> > > > > Hmm, is there a possible race condition here? > > > > Consider two threads calling this with `v=2` and `v=3`, and initial > > value as `0`, and the order would be: > > > > T1: `fetch_add(2)`, `prev` in T1 == 0; > > T2: `fetch_add(3)`, `prev` in T2 == 2; > > T2: `push(2 + 3)`, metric value is 5; > > T1: `push(0 + 2)`, metric value is 2. > > > > Metric value and cached value in gauge would be out of sync due to the > > race condition? My bad, realized that `push()` is only used as history. Actual sampling happens through `value()` function. Please ignore my comment. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66828/#review202194 --- On April 26, 2018, 2:54 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66828/ > --- > > (Updated April 26, 2018, 2:54 p.m.) > > > Review request for mesos, Gaston Kleiman, Gilbert Song, Greg Mann, James > Peach, and Vinod Kone. > > > Bugs: MESOS-8851 > https://issues.apache.org/jira/browse/MESOS-8851 > > > Repository: mesos > > > Description > --- > > A push-based gauge differs from a pull-based gauge in that the > client is responsible for pushing the latest value into the gauge > whenever it changes. This can be challenging in some cases as it > requires the client to have a good handle on when the gauge value > changes (rather than just computing the current value when asked). > > It is highly recommended to use push-based gauges if possible as > they provide significant performance benefits over pull-based > gauges. Pull-based gauge suffer from delays getting processed on > the event queue of a `Process`, as well as incur computation cost > on the `Process` each time the metrics are collected. Push-based > gauges, on the other hand, incur no cost to the owning `Process` > when metrics are collected, and instead incur a trivial cost when > the `Process` pushes new values in. > > > Diffs > - > > 3rdparty/libprocess/include/Makefile.am > cd2c3bc62df8de5b50ec2fa830b3e2634ba11e28 > 3rdparty/libprocess/include/process/metrics/push_gauge.hpp PRE-CREATION > > > Diff: https://reviews.apache.org/r/66828/diff/1/ > > > Testing > --- > > Test added in the subsequent patch. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 66828: Introduced a push-based gauge metric.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66828/#review202194 --- 3rdparty/libprocess/include/process/metrics/push_gauge.hpp Lines 70-75 (patched) <https://reviews.apache.org/r/66828/#comment283932> Hmm, is there a possible race condition here? Consider two threads calling this with `v=2` and `v=3`, and initial value as `0`, and the order would be: T1: `fetch_add(2)`, `prev` in T1 == 0; T2: `fetch_add(3)`, `prev` in T2 == 2; T2: `push(2 + 3)`, metric value is 5; T1: `push(0 + 2)`, metric value is 2. Metric value and cached value in gauge would be out of sync due to the race condition? - Zhitao Li On April 26, 2018, 2:54 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66828/ > --- > > (Updated April 26, 2018, 2:54 p.m.) > > > Review request for mesos, Gaston Kleiman, Gilbert Song, Greg Mann, James > Peach, and Vinod Kone. > > > Bugs: MESOS-8851 > https://issues.apache.org/jira/browse/MESOS-8851 > > > Repository: mesos > > > Description > --- > > A push-based gauge differs from a pull-based gauge in that the > client is responsible for pushing the latest value into the gauge > whenever it changes. This can be challenging in some cases as it > requires the client to have a good handle on when the gauge value > changes (rather than just computing the current value when asked). > > It is highly recommended to use push-based gauges if possible as > they provide significant performance benefits over pull-based > gauges. Pull-based gauge suffer from delays getting processed on > the event queue of a `Process`, as well as incur computation cost > on the `Process` each time the metrics are collected. Push-based > gauges, on the other hand, incur no cost to the owning `Process` > when metrics are collected, and instead incur a trivial cost when > the `Process` pushes new values in. > > > Diffs > - > > 3rdparty/libprocess/include/Makefile.am > cd2c3bc62df8de5b50ec2fa830b3e2634ba11e28 > 3rdparty/libprocess/include/process/metrics/push_gauge.hpp PRE-CREATION > > > Diff: https://reviews.apache.org/r/66828/diff/1/ > > > Testing > --- > > Test added in the subsequent patch. > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 54986: Added pull gauges for slave message queue.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54986/ --- (Updated May 1, 2018, 8:54 a.m.) Review request for mesos, Eric Chung, Jason Lai, and James Peach. Changes --- Rebase and use PullGauge. Summary (updated) - Added pull gauges for slave message queue. Bugs: MESOS-6831 https://issues.apache.org/jira/browse/MESOS-6831 Repository: mesos Description (updated) --- Added pull gauges for slave message queue. Diffs (updated) - src/slave/metrics.hpp 80b47222e05997b23041cc51037824a4cef3ec02 src/slave/metrics.cpp a1d01edea108ddecb92030fbd7f2e25f0fd143f0 src/slave/slave.hpp fb911efe4985c7bf3b340f29a5d884d476307705 src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c Diff: https://reviews.apache.org/r/54986/diff/3/ Changes: https://reviews.apache.org/r/54986/diff/2-3/ Testing --- make check. Thanks, Zhitao Li
Re: Review Request 53267: Added a PullGauge to track active subscribers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53267/ --- (Updated May 1, 2018, 8:54 a.m.) Review request for mesos, Anand Mazumdar, Jason Lai, and James Peach. Changes --- Rebase and use PullGauge. Summary (updated) - Added a PullGauge to track active subscribers. Bugs: MESOS-6499 https://issues.apache.org/jira/browse/MESOS-6499 Repository: mesos Description (updated) --- Added a PullGauge to track active subscribers. Diffs (updated) - src/master/master.hpp c30cf082d9ddf2e5dd97e0d887d6d07d8e03cbd4 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/master/metrics.hpp ec76dbcd1d1fa5349d62ce73fb9603e1986a776b src/master/metrics.cpp e46ead79f3f29e285426f9d061337077f453aa45 Diff: https://reviews.apache.org/r/53267/diff/4/ Changes: https://reviews.apache.org/r/53267/diff/3-4/ Testing --- Ran this on master with a client keeps subscribing and disconnecting. Observes that gauge gets updated and new log line is printed. Thanks, Zhitao Li
Re: Review Request 66875: Added an hourly timer for docker store pull latency.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66875/ --- (Updated April 30, 2018, 4:32 p.m.) Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- This patch added pull latency tracking for docker store, which can tell us both latency distribution of pull as well as number of pulls. Diffs - src/slave/containerizer/mesos/provisioner/docker/store.cpp 8b3f07f5027cb90d4b4ed401960494709d3eda5f Diff: https://reviews.apache.org/r/66875/diff/1/ Testing (updated) --- Ran agent in command line and trigger several launches through `mesos-execute`, observed following metrics from agent endpoint: ``` "containerizer/mesos/docker_store/pull_ms": 4084.254208, "containerizer/mesos/docker_store/pull_ms/count": 2, "containerizer/mesos/docker_store/pull_ms/max": 4084.254208, "containerizer/mesos/docker_store/pull_ms/min": 3525.044736, "containerizer/mesos/docker_store/pull_ms/p50": 3804.649472, "containerizer/mesos/docker_store/pull_ms/p90": 4028.3332608, "containerizer/mesos/docker_store/pull_ms/p95": 4056.2937344, "containerizer/mesos/docker_store/pull_ms/p99": 4078.66211328, "containerizer/mesos/docker_store/pull_ms/p999": 4083.694998528, "containerizer/mesos/docker_store/pull_ms/p": 4084.1982870528, ``` Thanks, Zhitao Li
Re: Review Request 66292: Validated that all tasks in the same group have same max_duration.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66292/ --- (Updated April 30, 2018, 3:51 p.m.) Review request for mesos, Jason Lai and James Peach. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- Validated that all tasks in the same group have same max_duration. Diffs - src/master/validation.cpp 9254cd22e67d9aca53135701575fb0857ff6f7d6 Diff: https://reviews.apache.org/r/66292/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 66875: Added an hourly timer for docker store pull latency.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66875/ --- (Updated April 30, 2018, 3:36 p.m.) Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- This patch added pull latency tracking for docker store, which can tell us both latency distribution of pull as well as number of pulls. Diffs - src/slave/containerizer/mesos/provisioner/docker/store.cpp 8b3f07f5027cb90d4b4ed401960494709d3eda5f Diff: https://reviews.apache.org/r/66875/diff/1/ Testing --- Thanks, Zhitao Li
Review Request 66875: Added an hourly timer for docker store pull latency.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66875/ --- Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu. Repository: mesos Description --- This patch added pull latency tracking for docker store, which can tell us both latency distribution of pull as well as number of pulls. Diffs - src/slave/containerizer/mesos/provisioner/docker/store.cpp 8b3f07f5027cb90d4b4ed401960494709d3eda5f Diff: https://reviews.apache.org/r/66875/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 53105: Added an hourly timer for `slave/docker_containerizer/pull`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53105/ --- (Updated April 30, 2018, 3:35 p.m.) Review request for mesos, Xiaojian Huang, Gilbert Song, Jason Lai, and Jie Yu. Changes --- Rebase to revive the patch. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- Added an hourly timer for `slave/docker_containerizer/pull`. Diffs (updated) - src/slave/containerizer/docker.hpp 6bdf4c7f60384473927de437a061fafe94170557 src/slave/containerizer/docker.cpp a4c9c10e91e75f406329cbb2086f39b549cbeed0 Diff: https://reviews.apache.org/r/53105/diff/5/ Changes: https://reviews.apache.org/r/53105/diff/4-5/ Testing --- Ran `mesos-execute` with a docker image and docker containerizer, observed the following metrics: ``` curl -s localhost:5051/metrics/snapshot | jq . | grep pull "containerizer/docker/pull_ms/p999": 26209.36173824, "containerizer/docker/pull_ms/p90": 21036.405248, "containerizer/docker/pull_ms/p": 26256.388615424, "containerizer/docker/pull_ms/p50": 135.570944, "containerizer/docker/pull_ms/max": 26261.613824, "containerizer/docker/pull_ms/p95": 23649.009536, "containerizer/docker/pull_ms/min": 103.215872, "containerizer/docker/pull_ms/p99": 25739.0929664, "containerizer/docker/pull_ms": 26261.613824, "containerizer/docker/pull_ms/count": 3 ``` Thanks, Zhitao Li
Re: Review Request 53330: Tracked layers and pull latency in docker store.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53330/ --- (Updated April 30, 2018, 2:23 p.m.) Review request for mesos, Xiaojian Huang, Gilbert Song, Jason Lai, and Jie Yu. Bugs: MESOS-6451 https://issues.apache.org/jira/browse/MESOS-6451 Repository: mesos Description --- This patch added several metrics for docker store: - a counter for number of layers pulled - a gauge for total number of layers - a timer for pull latency distribution in the last hour Diffs - src/slave/containerizer/mesos/provisioner/docker/paths.hpp 232c027f8f96da0cb30b957bce4607d3695050d2 src/slave/containerizer/mesos/provisioner/docker/paths.cpp cd684b33eb308ce1eeb4539a5b2d51985d835db7 src/slave/containerizer/mesos/provisioner/docker/store.cpp 68ce265a5d0a61f8d9ed55dd14c630dcf893a7d2 Diff: https://reviews.apache.org/r/53330/diff/4/ Testing --- Use mesos-execute to pull several different images, and verified following counters in agent's metrics/snapshot: ``` curl localhost:5051/metrics/snapshot | jq . | grep containerizer/mesos/docker_pull "containerizer/mesos/docker_store/layers_pulled": 47, "containerizer/mesos/docker_store/pull_ms/p50": 7080.760064, "containerizer/mesos/docker_store/pull_ms/p90": 11653.6390912, "containerizer/mesos/docker_store/pull_ms/p": 12528.3066648832, "containerizer/mesos/docker_store/pull_ms": 4550.814976, "containerizer/mesos/docker_store/pull_ms/count": 4, "containerizer/mesos/docker_store/pull_ms/max": 12529.182208, "containerizer/mesos/docker_store/pull_ms/p999": 12520.426776832, "containerizer/mesos/docker_store/pull_ms/min": 3167.337984, "containerizer/mesos/docker_store/pull_ms/p95": 12091.4106496, "containerizer/mesos/docker_store/pull_ms/p99": 12441.62789632, "containerizer/mesos/docker_store/layers": 47, ``` Thanks, Zhitao Li
Re: Review Request 54987: Updated `docs/monitoring/md` for new agent event queue metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54987/ --- (Updated April 30, 2018, 2:09 p.m.) Review request for mesos, Anand Mazumdar, Xiaojian Huang, and Jason Lai. Changes --- Rebase. Bugs: MESOS-6831 https://issues.apache.org/jira/browse/MESOS-6831 Repository: mesos Description --- Updated `docs/monitoring/md` for new agent event queue metrics. Diffs (updated) - docs/monitoring.md c20b99eb32a50830ea6f66b5bdbfcabe45fc6ecb Diff: https://reviews.apache.org/r/54987/diff/2/ Changes: https://reviews.apache.org/r/54987/diff/1-2/ Testing --- Thanks, Zhitao Li
Re: Review Request 54986: Added metric for slave message queue.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54986/ --- (Updated April 30, 2018, 2:08 p.m.) Review request for mesos, Eric Chung, Jason Lai, and James Peach. Changes --- Rebase. Bugs: MESOS-6831 https://issues.apache.org/jira/browse/MESOS-6831 Repository: mesos Description --- Added metric for slave message queue. Diffs (updated) - src/slave/metrics.hpp b771c4b54efb6a2ea841e99c16e098b33302ca89 src/slave/metrics.cpp 44294af7cac29e462330d94217eed9f767972f0e src/slave/slave.hpp c35996bdd564075b9b8f1467e13e24a6531d022e src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c Diff: https://reviews.apache.org/r/54986/diff/2/ Changes: https://reviews.apache.org/r/54986/diff/1-2/ Testing --- make check. Thanks, Zhitao Li
Re: Review Request 54986: Added metric for slave message queue.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54986/ --- (Updated April 30, 2018, 2:01 p.m.) Review request for mesos, Anand Mazumdar, Xiaojian Huang, and Jason Lai. Bugs: MESOS-6831 https://issues.apache.org/jira/browse/MESOS-6831 Repository: mesos Description --- Added metric for slave message queue. Diffs - src/slave/metrics.hpp 3fc933ca65690d6fad63156398ad9c2c53789296 src/slave/metrics.cpp 0eb2b59ed67e14e73b29d7592c239441df0008d5 src/slave/slave.hpp be72ba93d4c6ca3e69e3f76f87aa7ddebdbf9cad src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 Diff: https://reviews.apache.org/r/54986/diff/1/ Testing --- make check. Thanks, Zhitao Li
Review Request 66871: Added documentation about new gauge on `master/subscribers_active`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66871/ --- Review request for mesos, Jason Lai and James Peach. Bugs: MESOS-6499 https://issues.apache.org/jira/browse/MESOS-6499 Repository: mesos Description --- Added documentation about new gauge on `master/subscribers_active`. Diffs - docs/monitoring.md c20b99eb32a50830ea6f66b5bdbfcabe45fc6ecb Diff: https://reviews.apache.org/r/66871/diff/1/ Testing --- https://github.com/zhitaoli/mesos/blob/zhitao/public/active_subscribers/docs/monitoring.md#http-api Thanks, Zhitao Li
Re: Review Request 53267: Added a gauge to track active subscribers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53267/ --- (Updated April 30, 2018, 1:53 p.m.) Review request for mesos, Anand Mazumdar, Jason Lai, and James Peach. Changes --- Rebase and move. Bugs: MESOS-6499 https://issues.apache.org/jira/browse/MESOS-6499 Repository: mesos Description --- Added a gauge to track active subscribers. Diffs (updated) - src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/master/metrics.hpp 5414c4790f4606f910b0de81cbb08d398a0f3745 src/master/metrics.cpp 4cc96a1be532d6c7333a9fcdf8739d7640180777 Diff: https://reviews.apache.org/r/53267/diff/3/ Changes: https://reviews.apache.org/r/53267/diff/2-3/ Testing --- Ran this on master with a client keeps subscribing and disconnecting. Observes that gauge gets updated and new log line is printed. Thanks, Zhitao Li
Re: Review Request 66050: Implemented grow and shrink of persistent volumes.
> On April 24, 2018, 9:05 p.m., Chun-Hung Hsiao wrote: > > src/master/validation.cpp > > Lines 2332 (patched) > > <https://reviews.apache.org/r/66050/diff/13/?file=2011231#file2011231line2332> > > > > Can you add tests for this function in > > `src/tests/master_validation_tests.cpp`? To unblock this patch, it can be > > done in a follow-up one. Done in https://reviews.apache.org/r/66858/. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66050/#review201887 ------- On April 27, 2018, 12:24 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66050/ > --- > > (Updated April 27, 2018, 12:24 p.m.) > > > Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > The new offer operations are implemented as speculative operations, but > we will use validation to make them non-speculative on API level so that > we can transition later without a breaking change. > > > Diffs > - > > src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b > src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 > src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 > src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf > src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d > > > Diff: https://reviews.apache.org/r/66050/diff/14/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66858/ --- Review request for mesos, Chun-Hung Hsiao and Greg Mann. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- Added tests for validation of `GrowVolume` and `ShrinkVolume`. Diffs - src/tests/master_validation_tests.cpp a5229610a81dfac9d358597135b63ee51de86659 Diff: https://reviews.apache.org/r/66858/diff/1/ Testing --- Thanks, Zhitao Li
Re: Review Request 66531: Added new authorization for `ResizeVolume`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66531/ --- (Updated April 27, 2018, 1:14 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Bugs: MESOS-8748 https://issues.apache.org/jira/browse/MESOS-8748 Repository: mesos Description --- The new authorization action is modelled after `CreateVolume`, and will be shared by both `GROW_VOLUME` and `SHRINK_VOLUME` operations and corresponding operator APIs. Diffs (updated) - include/mesos/authorizer/acls.proto 8ef33210644e7d2924b402a3158b1b38c1fdb464 include/mesos/authorizer/authorizer.proto 1508c01130013d74ed2b2574a2428f38e3d2c064 src/authorizer/local/authorizer.cpp c098ba9ded1b29a2e079cf09ab80b61f6fa4af05 src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 Diff: https://reviews.apache.org/r/66531/diff/5/ Changes: https://reviews.apache.org/r/66531/diff/4-5/ Testing --- Manually created ACL and verified that untrusted principals will not allow to grow/shrink volumes. Also created unit test in next patch. Thanks, Zhitao Li
Re: Review Request 66051: Implemented operator API to grow and shrink persistent volume.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66051/ --- (Updated April 27, 2018, 1:09 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Updated after validation change. Bugs: MESOS-8747 https://issues.apache.org/jira/browse/MESOS-8747 Repository: mesos Description --- These operator APIs is implemented as speculative for now, but we plan to convert them to non-speculative in the future. Diffs (updated) - src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d Diff: https://reviews.apache.org/r/66051/diff/15/ Changes: https://reviews.apache.org/r/66051/diff/14-15/ Testing --- Thanks, Zhitao Li
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/ --- (Updated April 27, 2018, 1:08 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- use `Clock::settle()`. Bugs: MESOS-8748 https://issues.apache.org/jira/browse/MESOS-8748 Repository: mesos Description --- Added test for authorization actions for `RESIZE_VOLUME`. Diffs (updated) - src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66532/diff/6/ Changes: https://reviews.apache.org/r/66532/diff/5-6/ Testing --- Thanks, Zhitao Li
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> On April 25, 2018, 9:09 a.m., Greg Mann wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 451 (patched) > > <https://reviews.apache.org/r/66220/diff/6/?file=2006752#file2006752line451> > > > > It's not 100% clear to me if these tests actually verify that the > > agent's resource checkpointing has occurred? IIUC, settling the clock after > > awaiting on the `ApplyOperationMessage` which applies the GROW or SHRINK > > may be sufficient to verify this? They don't. I've rewrittent the comment. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/#review201843 ----------- On April 27, 2018, 1:04 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66220/ > --- > > (Updated April 27, 2018, 1:04 p.m.) > > > Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. > > > Diffs > - > > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66220/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 501-505 (patched) > > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line501> > > > > ``` > > // Disk spaces will be merged if fixture parameter is `NONE`. > > Bytes totalBytes = GetParam() == NONE ? MegaBytes(4096) : > > MegaBytes(2048); > > ``` > > > > TBH I'm not a fan of conditioning on the fixture parameters. If a new > > fixture is created, this can be resolved by having only one disk in > > `getSlaveResources()` instead. But given the fact that we also do similar > > conditional branches in other tests, I can live with this. I'll leave this as a TODO. > On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 823 (patched) > > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line823> > > > > This is not required. New test has offers so this will be necessary. > On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 853-855 (patched) > > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line853> > > > > This is essentially testing the same property as the last > > `EXPECT_FALSE` so we don't need it. New test dropped this. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/#review201969 --- On April 27, 2018, 1:04 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66220/ > --- > > (Updated April 27, 2018, 1:04 p.m.) > > > Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. > > > Diffs > - > > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66220/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> On April 13, 2018, 9:43 a.m., Greg Mann wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 455-459 (patched) > > <https://reviews.apache.org/r/66220/diff/4/?file=1996762#file1996762line455> > > > > Is this enforced somewhere in validation code? Can we check for > > expected behavior when a GROW/SHRINK operation is submitted for a MOUNT > > volume, rather than simply returning? > > Zhitao Li wrote: > I added validation in r/66050 so we drop shrink volume operation on > MOUNT. There is no logical path for triggering `GROW` on `MOUNT` disk type so > I'm not writing validation. > > For testing, we can either keep the current structure and check that > `GROW`/`SHRINK` do not work on `MOUNT` (operation will be dropped), or take > Chun's suggestion to not test them for `MOUNT`. Please let me know. > > Zhitao Li wrote: > I think this will be the only test case which needs to skip `MOUNT`. It > also makes some sense because there is no logical starting point for a > framework to even construct a `GROW_VOLUME` message for `MOUNT`. > > I still feel that the handling here could be better. we can discuss on > next patch revision. > > Chun-Hung Hsiao wrote: > I'm sorry for missing this reply. If we want to just skip `MOUNT`, I > prefer having a new fixture as a subclass of the original and make it only > parameterized against `NONE` and `PATH`; otherwise, how about doing proper > checks in `offersAfterGrow` based on the parameter: > ``` > if (GetParam() == MOUNT) { > EXPECT_EQ( > allocatedResources(Resources(volume), frameworkInfo.roles(0)), > Resources(offer.resources()).persistentVolumes()); > } else { > EXPECT_TRUE(os::exists(volumePath)); > EXPECT_SOME_EQ("abc", os::read(filePath)); > > EXPECT_EQ( > allocatedResources(Resources(grownVolume), frameworkInfo.roles(0)), > Resources(offer.resources()).persistentVolumes()); > > EXPECT_FALSE( > Resources(offer.resources()).contains( > allocatedResources(addition, frameworkInfo.roles(0; > } > ``` > My reason is that when seeing > `ROOT_MountDiskResource/PersistentVolumeTest.GrowVolume/0` passes really > means something is tested. I'm going to leave this as a TODO for future improvement. > On April 13, 2018, 9:43 a.m., Greg Mann wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 541-542 (patched) > > <https://reviews.apache.org/r/66220/diff/4/?file=1996762#file1996762line541> > > > > Is this `Future` necessary? Since the task consumes the volume, it may > > be sufficient to await on the task status updates? > > Zhitao Li wrote: > Yes this is necessary if we do not launch task in this test. we need to > reliably know that `Allocator::updateAllocation` is called from master, and > this message happens before that, so this `Future` ensures allocator has > properly processed all operation conversions and next offer will contain the > host and updated resources. > > Chun-Hung Hsiao wrote: > I see. Here you're relying on the fact that when the > `ApplyOperationMessage` is processed on the slave, the master has already > speculatively applied the operation called `updatedAllocation`. However > `updateAllocation` is asynchronous so there is still a potential race between > `ApplyOperationMessage` and the actual allocation update. > > I suggest instead of waiting for the internal message, let's do a > `Clock::settle();` before `Clock::advance(...)`. I just tested it locally. > > Ditto for all the following awaits for `ApplyOperationMessage`. `Clock::settle()` is not sufficient. What we need here is not confirming on agent processing, but confirming master has sent this to agent, because we reliably know that master has also called `allocatior::updateAllocation()` (which happens before this call). Otherwise allocator may not have correct view of resources. Unfornately master -> allocator communication cannot be intecepted by `FUTURE...` so this is the closest thing I can find. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/#review201109 --- On April 27, 2018, 1:04 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66220/ > --- > > (Upd
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> On April 25, 2018, 9:09 a.m., Greg Mann wrote: > > src/tests/persistent_volume_tests.cpp > > Lines 585-586 (patched) > > <https://reviews.apache.org/r/66220/diff/6/?file=2006752#file2006752line585> > > > > IIUC, the `Clock::settle()` here is ensuring that this test actually > > verifies that the agent's resources have been checkpointed? > > > > If so, I would recommend: > > 1) Add a comment explaining the reason for the settle. > > 2) Place the settle after the await. I think the two orderings are > > equivalent, but I think it's more readable if the settle is after the await? > > Chun-Hung Hsiao wrote: > The await can be removed. See my comments above. This is necessary to make sure that master -> allocator communication for `allocator::updateAllocation()` has happened. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/#review201843 ------- On April 27, 2018, 1:04 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66220/ > --- > > (Updated April 27, 2018, 1:04 p.m.) > > > Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. > > > Diffs > - > > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66220/diff/8/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66220/ --- (Updated April 27, 2018, 1:04 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations. Diffs (updated) - src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66220/diff/8/ Changes: https://reviews.apache.org/r/66220/diff/7-8/ Testing --- Thanks, Zhitao Li
Re: Review Request 66050: Implemented grow and shrink of persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66050/ --- (Updated April 27, 2018, 12:24 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments with better validation on chaining. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- The new offer operations are implemented as speculative operations, but we will use validation to make them non-speculative on API level so that we can transition later without a breaking change. Diffs (updated) - src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d Diff: https://reviews.apache.org/r/66050/diff/14/ Changes: https://reviews.apache.org/r/66050/diff/13-14/ Testing --- Thanks, Zhitao Li
Re: Review Request 66049: Added offer operation to grow and shrink persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66049/ --- (Updated April 27, 2018, 12:23 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Fix build when grpc is enabled. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- Added offer operation to grow and shrink persistent volumes. Diffs (updated) - include/mesos/mesos.proto 5bc4a80ad5278d20feced73dce755519515de505 include/mesos/v1/mesos.proto 5a4e733250831fa5fe86544aeb98dbc0a4f5afa6 src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/resource_provider/storage/provider.cpp 8ca2d3a98858940e6e027becefb53c2f00b4ae43 src/tests/mesos.hpp 756a52189ea3336b203ea1d606e2ba17762d57aa src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a src/tests/reservation_tests.cpp 5570df2e0b208512d0a0a3079a180e1acfe08f3d Diff: https://reviews.apache.org/r/66049/diff/12/ Changes: https://reviews.apache.org/r/66049/diff/11-12/ Testing --- Thanks, Zhitao Li
Re: Review Request 66284: Tested `max_completion_time` support in docker executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66284/ --- (Updated April 27, 2018, 9:09 a.m.) Review request for mesos, Jason Lai and James Peach. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- Tested `max_completion_time` support in docker executor. Diffs (updated) - src/tests/containerizer/docker_containerizer_tests.cpp 5e3dfdb537f52a85a82c6e8195e70a389b76aadf Diff: https://reviews.apache.org/r/66284/diff/4/ Changes: https://reviews.apache.org/r/66284/diff/3-4/ Testing --- Thanks, Zhitao Li
Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.
> On April 23, 2018, 3:43 p.m., Greg Mann wrote: > > src/master/master.cpp > > Line 4512 (original), 4515-4527 (patched) > > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515> > > > > I was just wondering - perhaps this code belongs in the validation > > function for the ACCEPT call? Since this is something that we can easily > > confirm synchronously in the top-level scheduler API handler and return > > 400, rather than calling into the accept handlers. > > Zhitao Li wrote: > I'm fine in either case as long as it provides better consistency and > less code duplication (which I think is a problem). I guess the reason there > is a split is due to that AuthZ is async right now. > > I actualy wonder what happens if we move the slave connectivity checks to > `Master::accept()`: since master to agent connection through network can be > broken at any moment, I do not see checking that in `_accept()` is 100% safe > anyway, and `send()` calls to agent need to handle failures anyway. > > Alternatively, we can consider moving authorization checks as early as we > can, and condense all other validations after AuthZ returned (assuming AuthZ > do not depends on these validations). > > Zhitao Li wrote: > (replied too fast) I feel both work can reduce the code duplicate, and > makes it less ambiguous for future development. > > Chun-Hung Hsiao wrote: > IMO in general it's better to have validations before authN/authZ, > because validations is usually more light-weight than authN/authZ. Also, > validations should be synchronous, meaning that if the call is invalid, we > will get rid of it faster from the memory (rather than waiting for the > authN/authZ to finish). > > Chun-Hung Hsiao wrote: > But in this case since this is temporary, I'm fine with putting the logic > here. > > Chun-Hung Hsiao wrote: > Question: Are these operations usable with this restriction? It seems to > me that with this restriction, the framework will need to maintain one more > extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is > possible that the framework cannot get enough CPUs and mem after growing the > volume); 2. grow the persistent volume; 3. use the grown volume. If we > implement a proper validation right now, 1 and 2 can be combined, and there > will be no need to rewrite the framework in the future. Can you explain again > what the concern of implementing the correct validation logic again? > > Zhitao Li wrote: > Re: order of valiation vs AuthZ: I agree they should be consistency, but > we already have splits between `accept()` and `_accept()` and framework could > see operations dropped either before or after AuthZ too. Given that 1) AuthZ > will be async by nature and 2) certain states could change between `accept()` > and `_accept()` (offer validaty, agent connectivity), I feel that only check > AuthZ on `accept()` but delay any actual validation to `_accept()` results in > most consistent behavior. Indeed this could increase calls to authorizor but > I don't know whether that can be quantified. > > Re: usability of framework: I generally believe that framework always > needs to model each state to handle corner cases even if it uses chained > operations (because that can have partial success). That's also why I hope to > eventually not support chained operations at all. In the final form of the > code, whatever "correct validation logic" we write now will not be necessary, > and we can implicitly rely on `offeredResources` containment checks chained > operations (because `converted` resources will not be added back > immediately). Implementing this right now would require us to add extra > variables like `observedOfferedResources` and `actualOfferedResources` and > check them on each `case` of operations, which is not be as clean or as local > as current patch (which also easier to back out). > > Chun-Hung Hsiao wrote: > I'm not sure if I get it. Instead of applying the conversion on > `_offeredResources`, we could just do > ``` > _offeredResources -= consumed; > ``` > Wouldn't this solve the issue? Am I missing anything? > > Zhitao Li wrote: > This code block > (https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502) > requires `_offeredResources` properly tracks what allocator needs to > recover. I believe we still need to handle `converted` part of `grow` and > `shrink`, otherwise allocator could see inconsitent views. > > Chun-Hung Hsiao wrote: > Thanks for poin
Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.
> On April 23, 2018, 3:43 p.m., Greg Mann wrote: > > src/master/master.cpp > > Line 4512 (original), 4515-4527 (patched) > > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515> > > > > I was just wondering - perhaps this code belongs in the validation > > function for the ACCEPT call? Since this is something that we can easily > > confirm synchronously in the top-level scheduler API handler and return > > 400, rather than calling into the accept handlers. > > Zhitao Li wrote: > I'm fine in either case as long as it provides better consistency and > less code duplication (which I think is a problem). I guess the reason there > is a split is due to that AuthZ is async right now. > > I actualy wonder what happens if we move the slave connectivity checks to > `Master::accept()`: since master to agent connection through network can be > broken at any moment, I do not see checking that in `_accept()` is 100% safe > anyway, and `send()` calls to agent need to handle failures anyway. > > Alternatively, we can consider moving authorization checks as early as we > can, and condense all other validations after AuthZ returned (assuming AuthZ > do not depends on these validations). > > Zhitao Li wrote: > (replied too fast) I feel both work can reduce the code duplicate, and > makes it less ambiguous for future development. > > Chun-Hung Hsiao wrote: > IMO in general it's better to have validations before authN/authZ, > because validations is usually more light-weight than authN/authZ. Also, > validations should be synchronous, meaning that if the call is invalid, we > will get rid of it faster from the memory (rather than waiting for the > authN/authZ to finish). > > Chun-Hung Hsiao wrote: > But in this case since this is temporary, I'm fine with putting the logic > here. > > Chun-Hung Hsiao wrote: > Question: Are these operations usable with this restriction? It seems to > me that with this restriction, the framework will need to maintain one more > extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is > possible that the framework cannot get enough CPUs and mem after growing the > volume); 2. grow the persistent volume; 3. use the grown volume. If we > implement a proper validation right now, 1 and 2 can be combined, and there > will be no need to rewrite the framework in the future. Can you explain again > what the concern of implementing the correct validation logic again? > > Zhitao Li wrote: > Re: order of valiation vs AuthZ: I agree they should be consistency, but > we already have splits between `accept()` and `_accept()` and framework could > see operations dropped either before or after AuthZ too. Given that 1) AuthZ > will be async by nature and 2) certain states could change between `accept()` > and `_accept()` (offer validaty, agent connectivity), I feel that only check > AuthZ on `accept()` but delay any actual validation to `_accept()` results in > most consistent behavior. Indeed this could increase calls to authorizor but > I don't know whether that can be quantified. > > Re: usability of framework: I generally believe that framework always > needs to model each state to handle corner cases even if it uses chained > operations (because that can have partial success). That's also why I hope to > eventually not support chained operations at all. In the final form of the > code, whatever "correct validation logic" we write now will not be necessary, > and we can implicitly rely on `offeredResources` containment checks chained > operations (because `converted` resources will not be added back > immediately). Implementing this right now would require us to add extra > variables like `observedOfferedResources` and `actualOfferedResources` and > check them on each `case` of operations, which is not be as clean or as local > as current patch (which also easier to back out). > > Chun-Hung Hsiao wrote: > I'm not sure if I get it. Instead of applying the conversion on > `_offeredResources`, we could just do > ``` > _offeredResources -= consumed; > ``` > Wouldn't this solve the issue? Am I missing anything? This code block (https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502) requires `_offeredResources` properly tracks what allocator needs to recover. I believe we still need to handle `converted` part of `grow` and `shrink`, otherwise allocator could see inconsitent views. - Zhitao --- This is an automatically generated
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
> On April 17, 2018, 4:24 p.m., Greg Mann wrote: > > src/tests/authorization_tests.cpp > > Line 1979 (original), 1979 (patched) > > <https://reviews.apache.org/r/66532/diff/2/?file=1996775#file1996775line1979> > > > > Could you also add an end-to-end test of authorization for these > > operations? > > Zhitao Li wrote: > Sure. Do you have an example or existing test to add this? > > Greg Mann wrote: > There are some similar tests in both 'persistent_volume_tests.cpp' as > well as 'master_authorization_tests.cpp'. > > Tests which do something very similar in the persistent volume tests are > 'PersistentVolumeTest.BadACLDropCreateAndDestroy' and > 'PersistentVolumeTest.GoodACLCreateThenDestroy'. It would be good to verify > both the successful and failed authorization cases. I've added some test cases. It seems that there will be some level of code duplication and I'm not sure whether we want to follow the ones for `...CreateAndDrop`, or Chun's idea of smaller tests verifying one thing at a time. We can definitely iterate on these. Marking this comment as fixed. - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/#review201362 --- On April 24, 2018, 11:16 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66532/ > --- > > (Updated April 24, 2018, 11:16 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-8748 > https://issues.apache.org/jira/browse/MESOS-8748 > > > Repository: mesos > > > Description > --- > > Added test for authorization actions for `RESIZE_VOLUME`. > > > Diffs > - > > src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c > src/tests/persistent_volume_tests.cpp > 4edf781711d9efdb994114aeb6289b6af750b87a > > > Diff: https://reviews.apache.org/r/66532/diff/5/ > > > Testing > --- > > > Thanks, > > Zhitao Li > >
Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66532/ --- (Updated April 24, 2018, 11:16 p.m.) Review request for mesos, Chun-Hung Hsiao and Greg Mann. Changes --- Add initial tests for authorization of new operations. Bugs: MESOS-8748 https://issues.apache.org/jira/browse/MESOS-8748 Repository: mesos Description --- Added test for authorization actions for `RESIZE_VOLUME`. Diffs (updated) - src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c src/tests/persistent_volume_tests.cpp 4edf781711d9efdb994114aeb6289b6af750b87a Diff: https://reviews.apache.org/r/66532/diff/5/ Changes: https://reviews.apache.org/r/66532/diff/4-5/ Testing --- Thanks, Zhitao Li
Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.
> On April 23, 2018, 3:43 p.m., Greg Mann wrote: > > src/master/master.cpp > > Line 4512 (original), 4515-4527 (patched) > > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515> > > > > I was just wondering - perhaps this code belongs in the validation > > function for the ACCEPT call? Since this is something that we can easily > > confirm synchronously in the top-level scheduler API handler and return > > 400, rather than calling into the accept handlers. > > Zhitao Li wrote: > I'm fine in either case as long as it provides better consistency and > less code duplication (which I think is a problem). I guess the reason there > is a split is due to that AuthZ is async right now. > > I actualy wonder what happens if we move the slave connectivity checks to > `Master::accept()`: since master to agent connection through network can be > broken at any moment, I do not see checking that in `_accept()` is 100% safe > anyway, and `send()` calls to agent need to handle failures anyway. > > Alternatively, we can consider moving authorization checks as early as we > can, and condense all other validations after AuthZ returned (assuming AuthZ > do not depends on these validations). > > Zhitao Li wrote: > (replied too fast) I feel both work can reduce the code duplicate, and > makes it less ambiguous for future development. > > Chun-Hung Hsiao wrote: > IMO in general it's better to have validations before authN/authZ, > because validations is usually more light-weight than authN/authZ. Also, > validations should be synchronous, meaning that if the call is invalid, we > will get rid of it faster from the memory (rather than waiting for the > authN/authZ to finish). > > Chun-Hung Hsiao wrote: > But in this case since this is temporary, I'm fine with putting the logic > here. > > Chun-Hung Hsiao wrote: > Question: Are these operations usable with this restriction? It seems to > me that with this restriction, the framework will need to maintain one more > extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is > possible that the framework cannot get enough CPUs and mem after growing the > volume); 2. grow the persistent volume; 3. use the grown volume. If we > implement a proper validation right now, 1 and 2 can be combined, and there > will be no need to rewrite the framework in the future. Can you explain again > what the concern of implementing the correct validation logic again? Re: order of valiation vs AuthZ: I agree they should be consistency, but we already have splits between `accept()` and `_accept()` and framework could see operations dropped either before or after AuthZ too. Given that 1) AuthZ will be async by nature and 2) certain states could change between `accept()` and `_accept()` (offer validaty, agent connectivity), I feel that only check AuthZ on `accept()` but delay any actual validation to `_accept()` results in most consistent behavior. Indeed this could increase calls to authorizor but I don't know whether that can be quantified. Re: usability of framework: I generally believe that framework always needs to model each state to handle corner cases even if it uses chained operations (because that can have partial success). That's also why I hope to eventually not support chained operations at all. In the final form of the code, whatever "correct validation logic" we write now will not be necessary, and we can implicitly rely on `offeredResources` containment checks chained operations (because `converted` resources will not be added back immediately). Implementing this right now would require us to add extra variables like `observedOfferedResources` and `actualOfferedResources` and check them on each `case` of operations, which is not be as clean or as local as current patch (which also easier to back out). - Zhitao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66568/#review201785 --- On April 23, 2018, 1:07 p.m., Zhitao Li wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66568/ > --- > > (Updated April 23, 2018, 1:07 p.m.) > > > Review request for mesos, Chun-Hung Hsiao and Greg Mann. > > > Bugs: MESOS-4965 > https://issues.apache.org/jira/browse/MESOS-4965 > > > Repository: mesos > > > Description > --- > > These two operations are intended to be
Re: Review Request 66050: Implemented grow and shrink of persistent volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66050/ --- (Updated April 24, 2018, 2:16 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Rebase. Bugs: MESOS-4965 https://issues.apache.org/jira/browse/MESOS-4965 Repository: mesos Description --- The new offer operations are implemented as speculative operations, but we will use validation to make them non-speculative on API level so that we can transition later without a breaking change. Diffs (updated) - src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d Diff: https://reviews.apache.org/r/66050/diff/13/ Changes: https://reviews.apache.org/r/66050/diff/12-13/ Testing --- Thanks, Zhitao Li
Re: Review Request 66733: Added a new `RESIZE_VOLUME` agent capability.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66733/ --- (Updated April 24, 2018, 2:16 p.m.) Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann. Changes --- Review comments. Bugs: MESOS-4945 https://issues.apache.org/jira/browse/MESOS-4945 Repository: mesos Description --- This will be used as a feature flag to gate the new volume resize feature. This feature will be turn on by default once released. Diffs (updated) - include/mesos/mesos.proto 5bc4a80ad5278d20feced73dce755519515de505 include/mesos/v1/mesos.proto 5a4e733250831fa5fe86544aeb98dbc0a4f5afa6 src/common/protobuf_utils.hpp ae060f3a7946ac5862faeca15330e9642f934137 src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b src/slave/constants.cpp 51de71bc45ff4fb862690efd9428b25d92055391 src/slave/flags.cpp 0ee4f65b14b578c8f8de60d42dff0ee438cdd8a8 src/tests/master_tests.cpp d5ce52c0eb1ba333d1b3dd35518545c13d828ce6 src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c Diff: https://reviews.apache.org/r/66733/diff/2/ Changes: https://reviews.apache.org/r/66733/diff/1-2/ Testing --- Thanks, Zhitao Li
Re: Review Request 66291: Added support to `max_completion_time` in default executor.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66291/ --- (Updated April 23, 2018, 4:56 p.m.) Review request for mesos, Jason Lai and James Peach. Changes --- Use `killedByCompletionTimeout` to indicate kill reason, and refactor `kill()` to reuse a grace period. Bugs: MESOS-8725 https://issues.apache.org/jira/browse/MESOS-8725 Repository: mesos Description --- When a task group has multiple tasks: - each task can have its own `max_completion_time`, or not have one; - if a task succeeds before its `max_completion_time`, all other tasks will keep running; - if a task fails, all other tasks in the same group will fail (as before); - if a task does not succeed before its `max_completion_time`, it will fail with `TASK_FAILED` and reason `REASON_MAX_COMPLETION_TIME_REACHED`, while other tasks will be killed without above reason. Diffs (updated) - src/launcher/default_executor.cpp ea0d425bb60e970f209f411632e1d486c279259b Diff: https://reviews.apache.org/r/66291/diff/3/ Changes: https://reviews.apache.org/r/66291/diff/2-3/ Testing --- Thanks, Zhitao Li