[
https://issues.apache.org/jira/browse/MESOS-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557020#comment-14557020
]
Joris Van Remoortere commented on MESOS-2254:
---------------------------------------------
There are 2 issues here:
1) For a given usage call, we grab all running processes and parse the entire
proc/[pid]/stat file. This is very CPU intensive. We parse it because want to
compute the pid hierarchy for a subtree, and so we want to extract the parent
pid. When an OS has many running processes, we end up doing a lot of extra
parsing of process stat files that we don't end up caring about.
2) Every update() call in the containerizer will call update() on the isolator
for each container. This means if we are running N executors, then we actually
end up doing the work in 1) N times.
We are parsing /proc/[pid]/stat for all running processes N times every M
seconds, where M is the monitoring interval.
There are a couple of approaches to fixing this. One is to only parse what is
necessary to build the subtree, and then parse the information for the
processes we actually care about.
> Posix CPU isolator usage call introduce high cpu load
> -----------------------------------------------------
>
> Key: MESOS-2254
> URL: https://issues.apache.org/jira/browse/MESOS-2254
> Project: Mesos
> Issue Type: Bug
> Reporter: Niklas Quarfot Nielsen
>
> With more than 20 executors running on a slave with the posix isolator, we
> have seen a very high cpu load (over 200%).
> From profiling one thread (there were two, taking up all the cpu time. The
> total CPU time was over 200%):
> {code}
> Running Time Self Symbol Name
> 27133.0ms 47.8% 0.0 _pthread_body 0x1adb50
> 27133.0ms 47.8% 0.0 thread_start
> 27133.0ms 47.8% 0.0 _pthread_start
> 27133.0ms 47.8% 0.0 _pthread_body
> 27133.0ms 47.8% 0.0 process::schedule(void*)
> 27133.0ms 47.8% 2.0
> process::ProcessManager::resume(process::ProcessBase*)
> 27126.0ms 47.8% 1.0
> process::ProcessBase::serve(process::Event const&)
> 27125.0ms 47.8% 0.0
> process::DispatchEvent::visit(process::EventVisitor*) const
> 27125.0ms 47.8% 0.0
> process::ProcessBase::visit(process::DispatchEvent const&)
> 27125.0ms 47.8% 0.0 std::__1::function<void
> (process::ProcessBase*)>::operator()(process::ProcessBase*) const
> 27124.0ms 47.8% 0.0
> std::__1::__function::__func<process::Future<mesos::ResourceStatistics>
> process::dispatch<mesos::ResourceStatistics,
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&,
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess>
> const&, process::Future<mesos::ResourceStatistics>
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&),
> mesos::ContainerID)::'lambda'(process::ProcessBase*),
> std::__1::allocator<process::Future<mesos::ResourceStatistics>
> process::dispatch<mesos::ResourceStatistics,
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&,
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess>
> const&, process::Future<mesos::ResourceStatistics>
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&),
> mesos::ContainerID)::'lambda'(process::ProcessBase*)>, void
> (process::ProcessBase*)>::operator()(process::ProcessBase*&&)
> 27124.0ms 47.8% 1.0
> process::Future<mesos::ResourceStatistics>
> process::dispatch<mesos::ResourceStatistics,
> mesos::internal::slave::IsolatorProcess, mesos::ContainerID const&,
> mesos::ContainerID>(process::PID<mesos::internal::slave::IsolatorProcess>
> const&, process::Future<mesos::ResourceStatistics>
> (mesos::internal::slave::IsolatorProcess::*)(mesos::ContainerID const&),
> mesos::ContainerID)::'lambda'(process::ProcessBase*)::operator()(process::ProcessBase*)
> const
> 27060.0ms 47.7% 1.0
> mesos::internal::slave::PosixCpuIsolatorProcess::usage(mesos::ContainerID
> const&)
> 27046.0ms 47.7% 2.0
> mesos::internal::usage(int, bool, bool)
> 27023.0ms 47.6% 2.0 os::pstree(Option<int>)
> 26748.0ms 47.1% 23.0 os::processes()
> 24809.0ms 43.7% 349.0 os::process(int)
> 8199.0ms 14.4% 47.0 os::sysctl::string()
> const
> 7562.0ms 13.3% 7562.0 __sysctl
> {code}
> We could see that usage() in usage/usage.cpp is causing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)