[ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193774#comment-15193774
 ] 

Anthony Scalisi commented on MESOS-4869:
----------------------------------------

What do you mean ? Without having Mesos doing the health checks, on a host with 
6 tasks for example:

{noformat}
scalp@mesos-slave-i-d00b6017 $ free -m
             total       used       free     shared    buffers     cached
Mem:         16047      15306        740          0       3174       2547
-/+ buffers/cache:       9583       6463
Swap:            0          0          0


root@mesos-slave-i-d00b6017 # docker stats --no-stream
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %             
  NET I/O               BLOCK I/O
33cb349404e1        3.23%               897.8 MB / 1.611 GB   55.74%            
  4.859 GB / 4.625 GB   53.25 kB / 61.44 kB
61eba49cf71d        3.22%               1.166 GB / 1.611 GB   72.41%            
  5.49 GB / 5.155 GB    106.5 kB / 118.8 kB
630739e12032        3.76%               1.163 GB / 1.611 GB   72.22%            
  3.891 GB / 3.657 GB   348.2 kB / 118.8 kB
b5b9da9facfb        2.84%               901.9 MB / 1.611 GB   55.99%            
  2.254 GB / 2.153 GB   0 B / 118.8 kB
dcd2a73f71a9        3.55%               1.29 GB / 1.611 GB    80.10%            
  2.726 GB / 2.672 GB   0 B / 118.8 kB
de923d88a781        3.17%               889.5 MB / 1.611 GB   55.23%            
  3.817 GB / 3.645 GB   36.86 kB / 61.44 kB
{noformat}

Or another with 11 tasks:

{noformat}
root@mesos-slave-i-0fe036d7 # free -m
             total       used       free     shared    buffers     cached
Mem:         16047      15189        857          0       1347        688
-/+ buffers/cache:      13153       2893
Swap:            0

root@mesos-slave-i-0fe036d7 # docker stats --no-stream
CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %             
  NET I/O               BLOCK I/O
1527ccec3562        0.39%               46.75 MB / 134.2 MB   34.83%            
  318.5 MB / 283.5 MB   634.9 kB / 0 B
16c0afe372f1        3.12%               1.139 GB / 1.611 GB   70.69%            
  5.443 GB / 5.139 GB   1.757 MB / 118.8 kB
2aaac6a34f3b        3.50%               1.34 GB / 1.611 GB    83.18%            
  9.928 GB / 9.006 GB   2.646 MB / 118.8 kB
4bda58242e66        2.57%               875.5 MB / 1.611 GB   54.36%            
  4.853 GB / 4.632 GB   135.2 kB / 61.44 kB
67ed575e6f44        2.14%               1.171 GB / 1.611 GB   72.73%            
  3.878 GB / 3.664 GB   4.739 MB / 118.8 kB
87010c4fa547        4.23%               1.208 GB / 1.611 GB   74.99%            
  313.5 MB / 419.1 MB   213 kB / 94.21 kB
8ca7c160b196        1.73%               730.4 MB / 1.611 GB   45.35%            
  305.6 MB / 447.7 MB   0 B / 61.44 kB
cbac44b2663c        4.66%               1.088 GB / 1.611 GB   67.53%            
  16.48 GB / 14.91 GB   262.1 kB / 61.44 kB
d0fe165aecac        3.02%               901.2 MB / 1.611 GB   55.95%            
  1.573 GB / 1.555 GB   106.5 kB / 61.44 kB
df668f59a149        3.57%               1.143 GB / 1.611 GB   70.98%            
  2.732 GB / 2.681 GB   1.888 MB / 118.8 kB
e0fc97fa33cf        3.43%               1.034 GB / 1.611 GB   64.21%            
  3.823 GB / 3.655 GB   2.433 MB / 61.44 kB
{noformat}

> /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
> -------------------------------------------------------------------
>
>                 Key: MESOS-4869
>                 URL: https://issues.apache.org/jira/browse/MESOS-4869
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.1
>            Reporter: Anthony Scalisi
>            Priority: Critical
>
> We switched our health checks in Marathon from HTTP to COMMAND:
> {noformat}
> "healthChecks": [
>     {
>       "protocol": "COMMAND",
>       "path": "/ops/ping",
>       "command": { "value": "curl --silent -f -X GET 
> http://$HOST:$PORT0/ops/ping > /dev/null" },
>       "gracePeriodSeconds": 90,
>       "intervalSeconds": 2,
>       "portIndex": 0,
>       "timeoutSeconds": 5,
>       "maxConsecutiveFailures": 3
>     }
>   ]
> {noformat}
> All our applications have the same health check (and /ops/ping endpoint).
> Even though we have the issue on all our Meos slaves, I'm going to focus on a 
> particular one: *mesos-slave-i-e3a9c724*.
> The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:
> !https://i.imgur.com/gbRf804.png!
> Here is a *docker ps* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724 # docker ps
> CONTAINER ID        IMAGE               COMMAND                  CREATED      
>        STATUS              PORTS                     NAMES
> 4f7c0aa8d03a        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31926->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
> 66f2fc8f8056        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31939->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
> f7382f241fce        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31656->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
> 880934c0049e        java:8              "/bin/sh -c 'JAVA_OPT"   24 hours ago 
>        Up 24 hours         0.0.0.0:31371->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
> 5eab1f8dac4a        java:8              "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>        Up 46 hours         0.0.0.0:31500->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
> b63740fe56e7        java:8              "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>        Up 46 hours         0.0.0.0:31382->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
> 5c7a9ea77b0e        java:8              "/bin/sh -c 'JAVA_OPT"   2 days ago   
>        Up 2 days           0.0.0.0:31186->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
> 53065e7a31ad        java:8              "/bin/sh -c 'JAVA_OPT"   2 days ago   
>        Up 2 days           0.0.0.0:31839->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
> {noformat}
> Here is a *docker stats* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724  # docker stats
> CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %           
>     NET I/O               BLOCK I/O
> 4f7c0aa8d03a        2.93%               797.3 MB / 1.611 GB   49.50%          
>     1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
> 53065e7a31ad        8.30%               738.9 MB / 1.611 GB   45.88%          
>     419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
> 5c7a9ea77b0e        4.91%               1.081 GB / 1.611 GB   67.10%          
>     423 MB / 526.5 MB     3.219 MB / 61.44 kB
> 5eab1f8dac4a        3.13%               1.007 GB / 1.611 GB   62.53%          
>     2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
> 66f2fc8f8056        3.15%               768.1 MB / 1.611 GB   47.69%          
>     258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
> 880934c0049e        10.07%              735.1 MB / 1.611 GB   45.64%          
>     1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
> b63740fe56e7        12.04%              629 MB / 1.611 GB     39.06%          
>     10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
> f7382f241fce        6.21%               505 MB / 1.611 GB     31.36%          
>     153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
> {noformat}
> Not much else is running on the slave, yet the used memory doesn't map to the 
> tasks memory:
> {noformat}
> Mem:16047M used:13340M buffers:1139M cache:776M
> {noformat}
> If I exec into the container (*java:8* image), I can see correctly the shell 
> calls to execute the curl specified in the health check as expected and exit 
> correctly.
> The only change we noticed since the memory usage woes was related to moving 
> to Mesos doing the health checks instead, so I decided to take a look:
> {noformat}
> root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
>  2504 ?        Sl    47:33 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:53432 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388
>  4220 ?        Sl    47:26 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:54982 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388
>  7444 ?        Sl     1:31 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:59422 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388
> 10368 ?        Sl     1:30 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:40981 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388
> 12399 ?        Sl     9:45 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:44815 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388
> 13538 ?        Sl    24:54 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:56598 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388
> 32034 ?        Sl     1:31 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:48119 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388
> {noformat}
> The memory usage is really bad:
> {noformat}
> root@mesos-slave-i-e3a9c724 # ps -eo size,pid,user,command --sort -size | 
> grep health_check | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 
> ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'
>       2185.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:53432 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388
>       2185.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:54982 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388
>       1673.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:56598 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388
>       1161.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:44815 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:59422 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:40981 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:48119 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388
>          0.32 Mb grep --color=auto health_check
> {noformat}
> Killing the *mesos-health-check* process for each container fix our memory 
> issues (but I'm assuming health checks won't be reported anymore or 
> something):
> {noformat}
> root@mesos-slave-i-e3a9c724 # date ; free -m ; ps awwx | grep health_check | 
> grep -v grep | awk '{print $1}' | xargs -I% -P1 kill % ; date ; free -m
> Fri Mar  4 21:20:55 UTC 2016
>              total       used       free     shared    buffers     cached
> Mem:         16047      13538       2508          0       1140        774
> -/+ buffers/cache:      11623       4423
> Swap:            0          0          0
> Fri Mar  4 21:20:56 UTC 2016
>              total       used       free     shared    buffers     cached
> Mem:         16047       9101       6945          0       1140        774
> -/+ buffers/cache:       7186       8860
> Swap:            0          0          0
> {noformat}
> We're reverting to Marathon doing the health checks for now but would like to 
> emphasize it's happening across all our slaves (not an isolated issue).
> Thanks for looking into it :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to