[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16812635#comment-16812635
 ] 

Vinod Kone commented on MESOS-6285:
-----------------------------------

Note that we currently read the executor state from disk for *all* completed 
executors in `state.cpp`. We can improve this to only read completed executor 
information until we reach the completed executors per framework limit. Same 
with completed tasks and completed frameworks.

> Agents may OOM during recovery if there are too many tasks or executors
> -----------------------------------------------------------------------
>
>                 Key: MESOS-6285
>                 URL: https://issues.apache.org/jira/browse/MESOS-6285
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Joseph Wu
>            Priority: Critical
>              Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>    // Helper to launch a task using an offer.
>    void launch(const Offer& offer)
>    {
> -    int taskId = tasksLaunched++;
> -    ++metrics.tasks_launched;
> -
> -    TaskInfo task;
> -    task.set_name("Task " + stringify(taskId));
> -    task.mutable_task_id()->set_value(stringify(taskId));
> -    task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -    task.mutable_resources()->CopyFrom(taskResources);
> -    task.mutable_executor()->CopyFrom(executor);
> -
>      Call call;
>      call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>      Offer::Operation* operation = accept->add_operations();
>      operation->set_type(Offer::Operation::LAUNCH);
>  
> -    operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +    // Launch as many tasks as possible in the given offer.
> +    Resources remaining = Resources(offer.resources()).flatten();
> +    while (remaining.contains(taskResources)) {
> +      int taskId = tasksLaunched++;
> +      ++metrics.tasks_launched;
> +
> +      TaskInfo task;
> +      task.set_name("Task " + stringify(taskId));
> +      task.mutable_task_id()->set_value(stringify(taskId));
> +      task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +      task.mutable_resources()->CopyFrom(taskResources);
> +      task.mutable_executor()->CopyFrom(executor);
> +
> +      operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +      remaining -= taskResources;
> +    }
>  
>      mesos->send(call);
>    }
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to