[ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354944#comment-14354944
 ] 

Bernd Mathiske edited comment on MESOS-391 at 3/10/15 2:23 PM:
---------------------------------------------------------------

1) The slave process must not perform the deletions itself, because it might 
get unresponsive for unforeseeable time that way.

2) "Waiting" can be implemented by the libprocess "future.then()" pattern. Be 
careful to schedule the deletions on another process, but the ".then()" part 
back on the slave process.

Here is a pseudo code draft of code that we could put right in the middle of 
Slave::runTask(). Some renaming and filling out details required. We replace 
all code as of this line to the end of runTask().

{code:xml}
if (executor == NULL) {
  executor = framework->launchExecutor(executorInfo, task);
}
{code}

with this:

```C++
if (executor == NULL) {
  runTask2(...); // better naming needed.
} else {
  runTask3(...); // better naming needed.
}

General idea: runTask3() contains all the code that follows the above. 
runTask2() as follows:

void Slave::runTask2()
 {
  subdirCount = ...lstat(...)...; // more detail later
  if (subdirCount < ...LINK_MAX...) {
    executor = framework->launchExecutor(executorInfo, task);
    runTask3(...); // what was originally the rest of runTask()
  } else {
    paths = ...executorInfo...; // find out which paths to delete
    gc->scheduleImmediately(paths) // returns a future that signals completion 
of the deletions
      .then(defer(self(),
                  &Slave::runTask2, // trying again if still too many links
                  ...);
  }
}

This is still rough, but you get the idea, I hope. :-)



was (Author: bernd-mesos):
1) The slave process must not perform the deletions itself, because it might 
get unresponsive for unforeseeable time that way.

2) "Waiting" can be implemented by the libprocess "future.then()" pattern. Be 
careful to schedule the deletions on another process, but the ".then()" part 
back on the slave process.

Here is a pseudo code draft of code that we could put right in the middle of 
Slave::runTask(). Some renaming and filling out details required. We replace 
all code as of this line to the end of runTask().

```C++
if (executor == NULL) {
  executor = framework->launchExecutor(executorInfo, task);
}

with this:

```C++
if (executor == NULL) {
  runTask2(...); // better naming needed.
} else {
  runTask3(...); // better naming needed.
}

General idea: runTask3() contains all the code that follows the above. 
runTask2() as follows:

void Slave::runTask2()
 {
  subdirCount = ...lstat(...)...; // more detail later
  if (subdirCount < ...LINK_MAX...) {
    executor = framework->launchExecutor(executorInfo, task);
    runTask3(...); // what was originally the rest of runTask()
  } else {
    paths = ...executorInfo...; // find out which paths to delete
    gc->scheduleImmediately(paths) // returns a future that signals completion 
of the deletions
      .then(defer(self(),
                  &Slave::runTask2, // trying again if still too many links
                  ...);
  }
}

This is still rough, but you get the idea, I hope. :-)


> Slave GarbageCollector needs to also take into account the number of links, 
> when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-391
>                 URL: https://issues.apache.org/jira/browse/MESOS-391
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Bernd Mathiske
>              Labels: twitter
>
> The slave garbage collector does not take into account the number of links 
> present, which means that if we create a lot of executor directories (up to 
> LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
> to create executor directory 
> '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
>  Too many links
> *** Check failure stack trace: ***
>     @     0x7f9320f82f9d  google::LogMessage::Fail()
>     @     0x7f9320f88c07  google::LogMessage::SendToLog()
>     @     0x7f9320f8484c  google::LogMessage::Flush()
>     @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9320c70312  _CheckSome::~_CheckSome()
>     @     0x7f9320c9dd5c  
> mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
>     @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
>     @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
>     @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
>     @     0x7f9320e4c774  process::MessageEvent::visit()
>     @     0x7f9320e40a1d  process::ProcessManager::resume()
>     @     0x7f9320e41268  process::schedule()
>     @     0x7f932055973d  start_thread
>     @     0x7f931ef3df6d  clone
> The fix here is to take into account the number of links (st_nlinks), when 
> determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to