[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

Ritwik Yadav (JIRA) Thu, 05 Mar 2015 05:30:07 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14348733#comment-14348733
 ]


Ritwik Yadav commented on MESOS-391:
------------------------------------

The problem at hand is the number of executor directories in any framework 
which might exceed the limit (on the number of subdirectories any directory 
could potentially have). The limit, mandated by the underlying filesystem, can 
be obtained by using the ‘pathconf’ function to fetch ‘_PC_LINK_MAX’ for any 
given path [1].

The directory structure organization for Mesos slaves can be found in paths.hpp 
[2] described in detail by descriptive comments. 

In order to tackle this problem, an overview of the garbage collector process 
is required. Garbage collection of directories is done upon removal / recovery 
of frameworks and executors. A slave schedules a directory for garbage 
collection after ‘flags.gc_delay’ seconds from its last modification time.

Apart from this, the slave runs a separate thread to periodically check the 
disk utilization by the slave. It, immediately, garbage collects all the 
directories scheduled for garbage collection within the next ‘t’ seconds. ‘t’ 
can be expressed as a linear function (with +ve slope) of disk usage by the 
slave. 

The goal is to be able to garbage collect those scheduled directories sooner 
which have a high number of sub-directories or recursively if one of its 
sub-directories has a high number of sub-directories.

The number of sub-directories in any directory can be obtained by using the 
‘lstat’ function [3].

References:
1. http://man7.org/linux/man-pages/man3/pathconf.3.html
2. https://github.com/apache/mesos/blob/master/src/slave/paths.hpp
3. http://linux.die.net/man/2/lstat

> Slave GarbageCollector needs to also take into account the number of links, 
> when determining removal time.
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-391
>                 URL: https://issues.apache.org/jira/browse/MESOS-391
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Ritwik Yadav
>              Labels: twitter
>
> The slave garbage collector does not take into account the number of links 
> present, which means that if we create a lot of executor directories (up to 
> LINK_MAX), we won't necessarily GC.
> As a result of this, the slave crashes:
> F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
> to create executor directory 
> '/var/lib/mesos/slaves/201303090208-1937777162-5050-38880-267/frameworks/201103282247-0000000019-0000/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
>  Too many links
> *** Check failure stack trace: ***
>     @     0x7f9320f82f9d  google::LogMessage::Fail()
>     @     0x7f9320f88c07  google::LogMessage::SendToLog()
>     @     0x7f9320f8484c  google::LogMessage::Flush()
>     @     0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9320c70312  _CheckSome::~_CheckSome()
>     @     0x7f9320c9dd5c  
> mesos::internal::slave::paths::createExecutorDirectory()
>     @     0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
>     @     0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
>     @     0x7f9320c9cb43  ProtobufProcess<>::handler4<>()
>     @     0x7f9320c8678b  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f9320c9d1ab  ProtobufProcess<>::visit()
>     @     0x7f9320e4c774  process::MessageEvent::visit()
>     @     0x7f9320e40a1d  process::ProcessManager::resume()
>     @     0x7f9320e41268  process::schedule()
>     @     0x7f932055973d  start_thread
>     @     0x7f931ef3df6d  clone
> The fix here is to take into account the number of links (st_nlinks), when 
> determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

Reply via email to