----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/54183/ -----------------------------------------------------------
(Updated Dec. 12, 2016, 7:06 p.m.) Review request for mesos and Vinod Kone. Changes ------- Rebase, tweak comment. Bugs: MESOS-6619 https://issues.apache.org/jira/browse/MESOS-6619 Repository: mesos Description ------- Before partition-awareness, when an agent failed health checks, the master removed the agent from the registry, marked all of its tasks TASK_LOST, and moved them to the `completedTasks` list in the master's memory. Although "lost" tasks might still be running, partitioned agents would only be allowed to re-register if the master failed over, in which case the `completedTasks` map would be emptied. When partition-awareness was introduced, we initially followed the same scheme, with the only difference that partition-aware tasks are marked TASK_UNREACHABLE, not TASK_LOST. This scheme has a few shortcomings. First, partition-aware tasks might resume running when the partitioned agent re-registers. Second, we re-added non-partition aware tasks when the agent re-registered but then marked them completed when the framework is shutdown, resulting in two entries in `completedTasks`. This commit introduces a separate bounded map, `unreachableTasks`. These tasks are reported separately via the HTTP endpoints, because they have different semantics (unlike completed tasks, unreachable tasks can resume running). The size of this map is limited by a new master flag, `--max_unreachable_tasks_per_framework`. This commit also changes the master to omit re-adding non-partition-aware tasks on re-registering agents (unless the master has failed over): those tasks will shortly be shutdown anyway. Finally, this commit fixes a minor bug in the previous code: the previous coding neglected to shutdown non-partition-aware frameworks running on pre-1.0 Mesos agents that re-register with the master after a network partition. Diffs (updated) ----- docs/configuration.md efe3e9bd9d203a7ba44adf4ead24f14b8b577637 include/mesos/master/master.proto 0d33251a9016ab99a1d70f15637d55f41caefb63 include/mesos/v1/master/master.proto 09a82af88303a2d971da7c56a7075d7005932363 src/master/constants.hpp 5dd0667f62d2c0617cc0d5aed8cc005bd8344c88 src/master/flags.hpp 6a17b763dc76daa10073394f416b049e97a44238 src/master/flags.cpp e5edf3333d0a0c529a10dc602ef9a88a0ec60c69 src/master/http.cpp d52806dcf8e4d64ebb98e191a01408c0fcae17ac src/master/master.hpp c304e69af0b7a720ea8277088cabc0675eec8b57 src/master/master.cpp 8c1c7f94102a2f40fbdffaa36f2d1c15e78a906d src/tests/partition_tests.cpp 00cc815529dc4d303db638680eacb8f55713d1a1 Diff: https://reviews.apache.org/r/54183/diff/ Testing ------- `make check` Thanks, Neil Conway