[jira] [Comment Edited] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

Vinod Kone (JIRA) Mon, 08 Apr 2019 10:29:02 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812550#comment-16812550
 ]


Vinod Kone edited comment on MESOS-6285 at 4/8/19 5:27 PM:
-----------------------------------------------------------

We already limit the number of completed tasks per executor (200, not 
configurable), completed executors per framework (150, configurable) and max 
frameworks (50, not configurable) in memory. I don't think there's much value 
in storing metadata information about more than these 
tasks/executors/frameworks on the disk? If yes, we need to figure out how to GC 
a task/executor/framework once it goes out of the in-memory circular buffers / 
bounded hashmaps holding these.


was (Author: vinodkone):
We already limit the number of completed tasks per executor (200, not 
configurable) and completed executors per framework (150, configurable) in 
memory. I don't think there's much value in storing metadata information about 
more than these tasks/executors on the disk? If yes, we need to figure out how 
to GC a task/executor once it goes out of the in-memory circular buffers / 
bounded hashmaps holding these.

> Agents may OOM during recovery if there are too many tasks or executors
> -----------------------------------------------------------------------
>
>                 Key: MESOS-6285
>                 URL: https://issues.apache.org/jira/browse/MESOS-6285
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>    // Helper to launch a task using an offer.
>    void launch(const Offer& offer)
>    {
> -    int taskId = tasksLaunched++;
> -    ++metrics.tasks_launched;
> -
> -    TaskInfo task;
> -    task.set_name("Task " + stringify(taskId));
> -    task.mutable_task_id()->set_value(stringify(taskId));
> -    task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -    task.mutable_resources()->CopyFrom(taskResources);
> -    task.mutable_executor()->CopyFrom(executor);
> -
>      Call call;
>      call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>      Offer::Operation* operation = accept->add_operations();
>      operation->set_type(Offer::Operation::LAUNCH);
>  
> -    operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +    // Launch as many tasks as possible in the given offer.
> +    Resources remaining = Resources(offer.resources()).flatten();
> +    while (remaining.contains(taskResources)) {
> +      int taskId = tasksLaunched++;
> +      ++metrics.tasks_launched;
> +
> +      TaskInfo task;
> +      task.set_name("Task " + stringify(taskId));
> +      task.mutable_task_id()->set_value(stringify(taskId));
> +      task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +      task.mutable_resources()->CopyFrom(taskResources);
> +      task.mutable_executor()->CopyFrom(executor);
> +
> +      operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +      remaining -= taskResources;
> +    }
>  
>      mesos->send(call);
>    }
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

Reply via email to