[ 
https://issues.apache.org/jira/browse/MESOS-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545012#comment-14545012
 ] 

chenqiuhao commented on MESOS-2706:
-----------------------------------

Hi Tim,I update Mesos to 0.22.1 today. 
I  reproduced as well .
I noticed that when my task grew,the cpu usages of Process 'lt-mesos-slave' 
incresed and  awlays was 100%(1CPU cost),like followed.

========Top reuslt========
top - 14:12:17 up 48 days, 16:41,  8 users,  load average: 1.13, 1.36, 2.00
Tasks: 1717 total,   1 running, 1716 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.2 us,  8.2 sy,  0.0 ni, 83.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32782296 total, 17602948 used, 15179348 free,   193112 buffers
KiB Swap: 20479996 total,        0 used, 20479996 free. 10514872 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND    
                                                                                
                  
271790 root      20   0  952028  54868  14192 S 100.6  0.2 163:26.57 
lt-mesos-slave   
180453 root      20   0 12.791g 955736  26216 S   4.2  2.9  11:12.81 java  
……

====================

And when I used the command top -Hp 271790 to find what thread of  
lt-mesos-slave cost most,I found it was thread 271808 :

========top -Hp 271790 result========
top - 14:16:20 up 48 days, 16:45,  8 users,  load average: 1.36, 1.35, 1.84
Threads:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.2 us,  8.0 sy,  0.0 ni, 83.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32782296 total, 17599804 used, 15182492 free,   193112 buffers
KiB Swap: 20479996 total,        0 used, 20479996 free. 10515348 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND     
                                                                                
                  
271808 root      20   0  952028  54868  14192 R 99.9  0.2 165:04.18 
lt-mesos-slave                                                                  
                              
271807 root      20   0  952028  54868  14192 S  0.7  0.2   0:21.79 
lt-mesos-slave                                                                  
                              
271810 root      20   0  952028  54868  14192 S  0.3  0.2   0:16.86 
lt-mesos-slave                                                                  
                              
271813 root      20   0  952028  54868  14192 S  0.3  0.2   0:19.85 
lt-mesos-slave                                                                  
                              
271814 root      20   0  952028  54868  14192 S  0.3  0.2   0:23.27 
lt-mesos-slave                                                                  
                              
271815 root      20   0  952028  54868  14192 S  0.3  0.2   0:06.96 
lt-mesos-slave                                                                  
                              
271790 root      20   0  952028  54868  14192 S  0.0  0.2   0:00.05 
lt-mesos-slave                                                                  
                              
271809 root      20   0  952028  54868  14192 S  0.0  0.2   0:17.30 
lt-mesos-slave                                                                  
                              
271811 root      20   0  952028  54868  14192 S  0.0  0.2   0:22.63 
lt-mesos-slave                                                                  
                              
271812 root      20   0  952028  54868  14192 S  0.0  0.2   0:16.02 
lt-mesos-slave                                                                  
                              
271820 root      20   0  952028  54868  14192 S  0.0  0.2   0:00.53 
lt-mesos-slave                                                                  
                              
271821 root      20   0  952028  54868  14192 S  0.0  0.2   0:00.00 
lt-mesos-slave    

=======================
And then I used the pstack command for serveral times to check out what thread 
271808 did, I found :

=========1st time pstack 271808 =============== 
Thread 1 (process 271808):
#0  0x00007fcb05f5725d in read () from /lib64/libpthread.so.0
#1  0x00007fcb05cbe567 in std::__basic_file<char>::xsgetn(char*, long) () from 
/lib64/libstdc++.so.6
#2  0x00007fcb05cf7c50 in std::basic_filebuf<char, std::char_traits<char> 
>::underflow() () from /lib64/libstdc++.so.6
#3  0x00007fcb05cc16e9 in std::istream::get(std::basic_streambuf<char, 
std::char_traits<char> >&, char) () from /lib64/libstdc++.so.6
#4  0x00007fcb06837205 in proc::cmdline(Option<int> const&) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#5  0x00007fcb06837ec8 in os::process(int) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#6  0x00007fcb06b5fd1c in mesos::internal::usage(int, bool, bool) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#7  0x00007fcb06a97eb0 in 
mesos::internal::slave::DockerContainerizerProcess::__usage(mesos::ContainerID 
const&, int) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#8  0x00007fcb06a9817d in 
mesos::internal::slave::DockerContainerizerProcess::usage(mesos::ContainerID 
const&) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#9  0x00007fcb06aac623 in std::_Function_handler<void (process::ProcessBase*), 
process::Future<mesos::ResourceStatistics> 
process::dispatch<mesos::ResourceStatistics, 
mesos::internal::slave::DockerContainerizerProcess, mesos::ContainerID const&, 
mesos::ContainerID>(process::PID<mesos::internal::slave::DockerContainerizerProcess>
 const&, process::Future<mesos::ResourceStatistics> 
(mesos::internal::slave::DockerContainerizerProcess::*)(mesos::ContainerID 
const&), 
mesos::ContainerID)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#10 0x00007fcb06dca62a in 
process::ProcessManager::resume(process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#11 0x00007fcb06dca8ec in process::schedule(void*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#12 0x00007fcb05f50df3 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fcb0545f3dd in clone () from /lib64/libc.so.6

=========2nd time  pstack 271808 ==========
Thread 1 (process 271808):
#0  0x00007fcb05450a3d in open64 () from /lib64/libc.so.6
#1  0x00007fcb053e1160 in __GI__IO_file_fopen () from /lib64/libc.so.6
#2  0x00007fcb053d5914 in __fopen_internal () from /lib64/libc.so.6
#3  0x00000000004204cf in os::read(std::string const&) ()
#4  0x00007fcb06835cc7 in proc::status(int) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#5  0x00007fcb06837b71 in os::process(int) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#6  0x00007fcb06b5fd1c in mesos::internal::usage(int, bool, bool) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#7  0x00007fcb06a97eb0 in 
mesos::internal::slave::DockerContainerizerProcess::__usage(mesos::ContainerID 
const&, int) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#8  0x00007fcb06a9817d in 
mesos::internal::slave::DockerContainerizerProcess::usage(mesos::ContainerID 
const&) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#9  0x00007fcb06aac623 in std::_Function_handler<void (process::ProcessBase*), 
process::Future<mesos::ResourceStatistics> 
process::dispatch<mesos::ResourceStatistics, 
mesos::internal::slave::DockerContainerizerProcess, mesos::ContainerID const&, 
mesos::ContainerID>(process::PID<mesos::internal::slave::DockerContainerizerProcess>
 const&, process::Future<mesos::ResourceStatistics> 
(mesos::internal::slave::DockerContainerizerProcess::*)(mesos::ContainerID 
const&), 
mesos::ContainerID)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#10 0x00007fcb06dca62a in 
process::ProcessManager::resume(process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#11 0x00007fcb06dca8ec in process::schedule(void*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#12 0x00007fcb05f50df3 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fcb0545f3dd in clone () from /lib64/libc.so.6


===========3rd time pstack 271808=========================
Thread 1 (process 271808):
#0  0x00007fcb05459d57 in munmap () from /lib64/libc.so.6
#1  0x00007fcb053e26c2 in __GI__IO_setb () from /lib64/libc.so.6
#2  0x00007fcb053e0e90 in __GI__IO_file_close_it () from /lib64/libc.so.6
#3  0x00007fcb053d4f68 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#4  0x0000000000420560 in os::read(std::string const&) ()
#5  0x00007fcb06835cc7 in proc::status(int) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#6  0x00007fcb06837b71 in os::process(int) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#7  0x00007fcb06b5fd1c in mesos::internal::usage(int, bool, bool) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#8  0x00007fcb06a97eb0 in 
mesos::internal::slave::DockerContainerizerProcess::__usage(mesos::ContainerID 
const&, int) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#9  0x00007fcb06a9817d in 
mesos::internal::slave::DockerContainerizerProcess::usage(mesos::ContainerID 
const&) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#10 0x00007fcb06aac623 in std::_Function_handler<void (process::ProcessBase*), 
process::Future<mesos::ResourceStatistics> 
process::dispatch<mesos::ResourceStatistics, 
mesos::internal::slave::DockerContainerizerProcess, mesos::ContainerID const&, 
mesos::ContainerID>(process::PID<mesos::internal::slave::DockerContainerizerProcess>
 const&, process::Future<mesos::ResourceStatistics> 
(mesos::internal::slave::DockerContainerizerProcess::*)(mesos::ContainerID 
const&), 
mesos::ContainerID)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#11 0x00007fcb06dca62a in 
process::ProcessManager::resume(process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#12 0x00007fcb06dca8ec in process::schedule(void*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#13 0x00007fcb05f50df3 in start_thread () from /lib64/libpthread.so.0
#14 0x00007fcb0545f3dd in clone () from /lib64/libc.so.6

=========================
The resulst told us that Thread 271808 was handling the usage things.

#7  0x00007fcb06b5fd1c in mesos::internal::usage(int, bool, bool) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#8  0x00007fcb06a97eb0 in 
mesos::internal::slave::DockerContainerizerProcess::__usage(mesos::ContainerID 
const&, int) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#9  0x00007fcb06a9817d in 
mesos::internal::slave::DockerContainerizerProcess::usage(mesos::ContainerID 
const&) () from /wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so
#10 0x00007fcb06aac623 in std::_Function_handler<void (process::ProcessBase*), 
process::Future<mesos::ResourceStatistics> 
process::dispatch<mesos::ResourceStatistics, 
mesos::internal::slave::DockerContainerizerProcess, mesos::ContainerID const&, 
mesos::ContainerID>(process::PID<mesos::internal::slave::DockerContainerizerProcess>
 const&, process::Future<mesos::ResourceStatistics> 
(mesos::internal::slave::DockerContainerizerProcess::*)(mesos::ContainerID 
const&), 
mesos::ContainerID)::{lambda(process::ProcessBase*)#1}>::_M_invoke(std::_Any_data
 const&, process::ProcessBase*) () from 
/wls/mesos-0.22.1/build/src/.libs/libmesos-0.22.1.so


And I suppose lt-mesos-slave is a single process with multi-threads.The 
bottle-neck of mesos-slave is that the process only can cost 1CPU,and its 
usage-thread is too busy ,maybe some bugs of resource race .



> When the docker-tasks grow, the time spare between Queuing task and Starting 
> container grows
> --------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2706
>                 URL: https://issues.apache.org/jira/browse/MESOS-2706
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker
>    Affects Versions: 0.22.0
>         Environment: My Environment info:
> Mesos 0.22.0 & Marathon 0.82-RC1 both running in one host-server.
> Every docker-task require 0.02 CPU and 128MB ,and the server has 8 cpus and 
> 24G mems.
> So Mesos can launch thousands of task in theory.
> And the docker-task is very light-weight to launch a sshd service .
>            Reporter: chenqiuhao
>
> At the beginning, Marathon can launch docker-task very fast,but when the 
> number of tasks in the only-one mesos-slave host reached 50,It seemed 
> Marathon lauch docker-task slow.
> So I check the mesos-slave log,and I found that the time spare between 
> Queuing task and Starting container grew .
> For example, 
> launch the 1st docker task, it takes about 0.008s
> [root@CNSH231434 mesos-slave]# tail -f slave.out |egrep 'Queuing 
> task|Starting container'
> I0508 15:54:00.188350 225779 slave.cpp:1378] Queuing task 
> 'dev-rhel-sf.631d454d-f557-11e4-b4f4-628e0a30542b' for executor 
> dev-rhel-sf.631d454d-f557-11e4-b4f4-628e0a30542b of framework 
> '20150202-112355-2684495626-5050-26153-0000
> I0508 15:54:00.196832 225781 docker.cpp:581] Starting container 
> 'd0b0813a-6cb6-4dfd-bbce-f1b338744285' for task 
> 'dev-rhel-sf.631d454d-f557-11e4-b4f4-628e0a30542b' (and executor 
> 'dev-rhel-sf.631d454d-f557-11e4-b4f4-628e0a30542b') of framework 
> '20150202-112355-2684495626-5050-26153-0000'
> launch the 50th docker task, it takes about 4.9s
> I0508 16:12:10.908596 225781 slave.cpp:1378] Queuing task 
> 'dev-rhel-sf.ed3a6922-f559-11e4-ae87-628e0a30542b' for executor 
> dev-rhel-sf.ed3a6922-f559-11e4-ae87-628e0a30542b of framework 
> '20150202-112355-2684495626-5050-26153-0000
> I0508 16:12:15.801503 225778 docker.cpp:581] Starting container 
> '482dd47f-b9ab-4b09-b89e-e361d6f004a4' for task 
> 'dev-rhel-sf.ed3a6922-f559-11e4-ae87-628e0a30542b' (and executor 
> 'dev-rhel-sf.ed3a6922-f559-11e4-ae87-628e0a30542b') of framework 
> '20150202-112355-2684495626-5050-26153-0000'
> And when i launch the 100th docker task,it takes about 13s!
> And I did the same test in one 24 Cpus and 256G mems server-host, it got the 
> same result.
> Did somebody have the same experience , or Can help to do the same pressure 
> test ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to