[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

Bernd Mathiske (JIRA) Fri, 06 Mar 2015 03:20:10 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350228#comment-14350228
 ]


Bernd Mathiske commented on MESOS-391:
--------------------------------------

Here is an attempt to answer your questions from my understanding of the code 
base. 

1. It seems reasonable to assume that more recent left-overs have a higher 
probability of being "interesting" for inspection than older ones. 
flags.gc_delay is simply an estimated time after which this probability is 
deemed so low that one may safely delete. However, if for whatever reason a 
directory has been modified, then this constitutes more recent activity of 
potential interest, and we restart aging the directory from this moment. Notice 
that the slave code touches every directory just before scheduling GC for it. 
It seems to me that this is to ensure that the time to GC for the given path 
starts right then and there, not sooner.

2. Yes and no. In principle, it is preferable to remove directories in strict 
order of "least interesting" to "most interesting", which is "least recently 
used" to "most recently used". But I expect that we can break this "rule" to 
prevent calamity. When we do we should still preserve the original removal 
order within a given parent directory as much as possible. 

3. I don't think we can be absolutely certain about this. This case just has 
not been tickled by the program that led to this ticket. I suggest that it will 
be easy to cover this case as well.

We should also look at each of the other directories that slaves create, and 
form an opinion on them, just in case. There are these levels of directories:

1. slave
2. framework
3. executor
4. executor run (i.e. task run)

 and each comes as "work" (sandbox) and "meta".


> Slave GarbageCollector needs to also take into account the number of links, 
> when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-391
>                 URL: https://issues.apache.org/jira/browse/MESOS-391
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Ritwik Yadav
>              Labels: twitter
>
> The slave garbage collector does not take into account the number of links 
> present, which means that if we create a lot of executor directories (up to 
> LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
> to create executor directory 
> '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
>  Too many links
> *** Check failure stack trace: ***
>     @     0x7f9320f82f9d  google::LogMessage::Fail()
>     @     0x7f9320f88c07  google::LogMessage::SendToLog()
>     @     0x7f9320f8484c  google::LogMessage::Flush()
>     @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9320c70312  _CheckSome::~_CheckSome()
>     @     0x7f9320c9dd5c  
> mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
>     @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
>     @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
>     @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
>     @     0x7f9320e4c774  process::MessageEvent::visit()
>     @     0x7f9320e40a1d  process::ProcessManager::resume()
>     @     0x7f9320e41268  process::schedule()
>     @     0x7f932055973d  start_thread
>     @     0x7f931ef3df6d  clone
> The fix here is to take into account the number of links (st_nlinks), when 
> determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

Reply via email to