Yan Xu created MESOS-4975:
-----------------------------

             Summary: mesos::internal::master::Slave::tasks can grow unboundedly
                 Key: MESOS-4975
                 URL: https://issues.apache.org/jira/browse/MESOS-4975
             Project: Mesos
          Issue Type: Bug
          Components: master
            Reporter: Yan Xu


So in a Mesos cluster we observed the following

{noformat:title=}
$ jq '.orphan_tasks | length' state.json
1369
$ jq '.unregistered_frameworks | length' state.json
20162
{noformat}

Aside from {{unregistered_frameworks}} here being "the list of frameworkIDs for 
each orphan task" (described in MESOS-4973), the discrepancy between the two 
values above is surprising.

I think the problem is that we do this in the master:

>From 
>[source|https://github.com/apache/mesos/blob/e376d3aa0074710278224ccd17afd51971820dfb/src/master/master.cpp#L2212]:
{code}
    foreachvalue (Slave* slave, slaves.registered) {
      foreachvalue (Task* task, slave->tasks[framework->id()]) {
        framework->addTask(task);
      }
      foreachvalue (const ExecutorInfo& executor,
                    slave->executors[framework->id()]) {
        framework->addExecutor(slave->id, executor);
      }
    }
{code}

Here an {{operator[]}} is used whenever a framework subscribes regardless of 
whether this agent has tasks for the framework or not.

If the agent has no such task for this framework, then this \{frameworkID: 
empty hashmap\} entry will stay in the map indefinitely! If frameworks are 
ephemeral and new ones keep come in, the map grows unboundedly.

We should do {{tasks.contains(frameworkId)}} before using the {{[] operator}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to