[ 
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557020#comment-14557020
 ] 

Joris Van Remoortere commented on MESOS-2254:
---------------------------------------------

There are 2 issues here:
1) For a given usage call, we grab all running processes and parse the entire 
proc/[pid]/stat file. This is very CPU intensive. We parse it because want to 
compute the pid hierarchy for a subtree, and so we want to extract the parent 
pid. When an OS has many running processes, we end up doing a lot of extra 
parsing of process stat files that we don't end up caring about.
2) Every update() call in the containerizer will call update() on the isolator 
for each container. This means if we are running N executors, then we actually 
end up doing the work in 1) N times.

We are parsing /proc/[pid]/stat for all running processes N times every M 
seconds, where M is the monitoring interval.

There are a couple of approaches to fixing this. One is to only parse what is 
necessary to build the subtree, and then parse the information for the 
processes we actually care about.

> Posix CPU isolator usage call introduce high cpu load
> -----------------------------------------------------
>
>                 Key: MESOS-2254
>                 URL: https://issues.apache.org/jira/browse/MESOS-2254
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we 
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The 
> total CPU time was over 200%):
> {code}
> Running Time  Self            Symbol Name
> 27133.0ms   47.8%     0.0             _pthread_body  0x1adb50
> 27133.0ms   47.8%     0.0              thread_start
> 27133.0ms   47.8%     0.0               _pthread_start
> 27133.0ms   47.8%     0.0                _pthread_body
> 27133.0ms   47.8%     0.0                 process::schedule(void*)
> 27133.0ms   47.8%     2.0                  
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms   47.8%     1.0                   
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms   47.8%     0.0                    
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms   47.8%     0.0                     
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms   47.8%     0.0                      std::__1::function<void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms   47.8%     0.0                       
> std::__1::__function::__func<process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*), 
> std::__1::allocator<process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void 
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms   47.8%     1.0                        
> process::Future<mesos::ResourceStatistics> 
> process::dispatch<mesos::ResourceStatistics, 
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&, 
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess> 
> const&, process::Future<mesos::ResourceStatistics> 
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&), 
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
>  const
> 27060.0ms   47.7%     1.0                         
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID 
> const&)
> 27046.0ms   47.7%     2.0                          
> mesos::internal::usage(int, bool, bool)
> 27023.0ms   47.6%     2.0                           os::pstree(Option<int>)
> 26748.0ms   47.1%     23.0                           os::processes()
> 24809.0ms   43.7%     349.0                           os::process(int)
> 8199.0ms   14.4%      47.0                             os::sysctl::string() 
> const
> 7562.0ms   13.3%      7562.0                            __sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to