[
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17383897#comment-17383897
]
Andreas Peters commented on MESOS-6285:
---------------------------------------
Is this still a issue or can we close it? :)
> Agents may OOM during recovery if there are too many tasks or executors
> -----------------------------------------------------------------------
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 1.0.1
> Reporter: Joseph Wu
> Priority: Critical
> Labels: foundations, mesosphere
>
> On an test cluster, we encountered a degenerate case where running the
> example {{long-lived-framework}} for over a week would render the agent
> un-recoverable.
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and
> launches a single task on that executor every time it receives an offer from
> that agent. Over a week's worth of time, the framework manages to launch
> some 400k tasks (short sleeps) on one executor. During runtime, this is not
> problematic, as each completed task is quickly rotated out of the agent's
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory,
> which leads to slow recovery; and often results in the agent being OOM-killed
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
> // Helper to launch a task using an offer.
> void launch(const Offer& offer)
> {
> - int taskId = tasksLaunched++;
> - ++metrics.tasks_launched;
> -
> - TaskInfo task;
> - task.set_name("Task " + stringify(taskId));
> - task.mutable_task_id()->set_value(stringify(taskId));
> - task.mutable_agent_id()->MergeFrom(offer.agent_id());
> - task.mutable_resources()->CopyFrom(taskResources);
> - task.mutable_executor()->CopyFrom(executor);
> -
> Call call;
> call.set_type(Call::ACCEPT);
>
> @@ -380,7 +370,23 @@ private:
> Offer::Operation* operation = accept->add_operations();
> operation->set_type(Offer::Operation::LAUNCH);
>
> - operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> + // Launch as many tasks as possible in the given offer.
> + Resources remaining = Resources(offer.resources()).flatten();
> + while (remaining.contains(taskResources)) {
> + int taskId = tasksLaunched++;
> + ++metrics.tasks_launched;
> +
> + TaskInfo task;
> + task.set_name("Task " + stringify(taskId));
> + task.mutable_task_id()->set_value(stringify(taskId));
> + task.mutable_agent_id()->MergeFrom(offer.agent_id());
> + task.mutable_resources()->CopyFrom(taskResources);
> + task.mutable_executor()->CopyFrom(executor);
> +
> + operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> + remaining -= taskResources;
> + }
>
> mesos->send(call);
> }
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent
> + this patch, it should take about 10 minutes to build up sufficient task
> launches.
> 3) Restart the agent and watch it flail during recovery.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)