[ 
https://issues.apache.org/jira/browse/MESOS-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-4869:
----------------------------
    Labels: health-check  (was: )

> /usr/libexec/mesos/mesos-health-check using/leaking a lot of memory
> -------------------------------------------------------------------
>
>                 Key: MESOS-4869
>                 URL: https://issues.apache.org/jira/browse/MESOS-4869
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.1
>            Reporter: Anthony Scalisi
>            Priority: Critical
>              Labels: health-check
>
> We switched our health checks in Marathon from HTTP to COMMAND:
> {noformat}
> "healthChecks": [
>     {
>       "protocol": "COMMAND",
>       "path": "/ops/ping",
>       "command": { "value": "curl --silent -f -X GET 
> http://$HOST:$PORT0/ops/ping > /dev/null" },
>       "gracePeriodSeconds": 90,
>       "intervalSeconds": 2,
>       "portIndex": 0,
>       "timeoutSeconds": 5,
>       "maxConsecutiveFailures": 3
>     }
>   ]
> {noformat}
> All our applications have the same health check (and /ops/ping endpoint).
> Even though we have the issue on all our Meos slaves, I'm going to focus on a 
> particular one: *mesos-slave-i-e3a9c724*.
> The slave has 16 gigs of memory, with about 12 gigs allocated for 8 tasks:
> !https://i.imgur.com/gbRf804.png!
> Here is a *docker ps* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724 # docker ps
> CONTAINER ID        IMAGE               COMMAND                  CREATED      
>        STATUS              PORTS                     NAMES
> 4f7c0aa8d03a        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31926->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
> 66f2fc8f8056        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31939->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
> f7382f241fce        java:8              "/bin/sh -c 'JAVA_OPT"   6 hours ago  
>        Up 6 hours          0.0.0.0:31656->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
> 880934c0049e        java:8              "/bin/sh -c 'JAVA_OPT"   24 hours ago 
>        Up 24 hours         0.0.0.0:31371->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
> 5eab1f8dac4a        java:8              "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>        Up 46 hours         0.0.0.0:31500->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5ac75198-283f-4349-a220-9e9645b313e7
> b63740fe56e7        java:8              "/bin/sh -c 'JAVA_OPT"   46 hours ago 
>        Up 46 hours         0.0.0.0:31382->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
> 5c7a9ea77b0e        java:8              "/bin/sh -c 'JAVA_OPT"   2 days ago   
>        Up 2 days           0.0.0.0:31186->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
> 53065e7a31ad        java:8              "/bin/sh -c 'JAVA_OPT"   2 days ago   
>        Up 2 days           0.0.0.0:31839->8080/tcp   
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
> {noformat}
> Here is a *docker stats* on it:
> {noformat}
> root@mesos-slave-i-e3a9c724  # docker stats
> CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %           
>     NET I/O               BLOCK I/O
> 4f7c0aa8d03a        2.93%               797.3 MB / 1.611 GB   49.50%          
>     1.277 GB / 1.189 GB   155.6 kB / 151.6 kB
> 53065e7a31ad        8.30%               738.9 MB / 1.611 GB   45.88%          
>     419.6 MB / 554.3 MB   98.3 kB / 61.44 kB
> 5c7a9ea77b0e        4.91%               1.081 GB / 1.611 GB   67.10%          
>     423 MB / 526.5 MB     3.219 MB / 61.44 kB
> 5eab1f8dac4a        3.13%               1.007 GB / 1.611 GB   62.53%          
>     2.737 GB / 2.564 GB   6.566 MB / 118.8 kB
> 66f2fc8f8056        3.15%               768.1 MB / 1.611 GB   47.69%          
>     258.5 MB / 252.8 MB   1.86 MB / 151.6 kB
> 880934c0049e        10.07%              735.1 MB / 1.611 GB   45.64%          
>     1.451 GB / 1.399 GB   573.4 kB / 94.21 kB
> b63740fe56e7        12.04%              629 MB / 1.611 GB     39.06%          
>     10.29 GB / 9.344 GB   8.102 MB / 61.44 kB
> f7382f241fce        6.21%               505 MB / 1.611 GB     31.36%          
>     153.4 MB / 151.9 MB   5.837 MB / 94.21 kB
> {noformat}
> Not much else is running on the slave, yet the used memory doesn't map to the 
> tasks memory:
> {noformat}
> Mem:16047M used:13340M buffers:1139M cache:776M
> {noformat}
> If I exec into the container (*java:8* image), I can see correctly the shell 
> calls to execute the curl specified in the health check as expected and exit 
> correctly.
> The only change we noticed since the memory usage woes was related to moving 
> to Mesos doing the health checks instead, so I decided to take a look:
> {noformat}
> root@mesos-slave-i-e3a9c724 # ps awwx | grep health_check | grep -v grep
>  2504 ?        Sl    47:33 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:53432 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388
>  4220 ?        Sl    47:26 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:54982 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388
>  7444 ?        Sl     1:31 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:59422 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388
> 10368 ?        Sl     1:30 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:40981 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388
> 12399 ?        Sl     9:45 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:44815 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388
> 13538 ?        Sl    24:54 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:56598 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388
> 32034 ?        Sl     1:31 /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:48119 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388
> {noformat}
> The memory usage is really bad:
> {noformat}
> root@mesos-slave-i-e3a9c724 # ps -eo size,pid,user,command --sort -size | 
> grep health_check | awk '{ hr=$1/1024 ; printf("%13.2f Mb ",hr) } { for ( x=4 
> ; x<=NF ; x++ ) { printf("%s ",$x) } print "" }'
>       2185.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:53432 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.f0a3f4c5-ecdb-4f97-bede-d744feda670c
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_email-green.b086206a-e000-11e5-a617-02429957d388
>       2185.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:54982 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.b05043c5-44fc-40bf-aea2-10354e8f5ab4
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_chat-green.ed53ec41-e000-11e5-a617-02429957d388
>       1673.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:56598 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.5d417f16-df24-49d5-a5b0-38a7966460fe
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_metric-green.75296986-e0c7-11e5-a617-02429957d388
>       1161.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:44815 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.23dfe408-ab8f-40be-bf6f-ce27fe885ee0
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_integration-green.143865d5-e17d-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:59422 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.60972150-b2b1-45d8-8a55-d63e81b8372a
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_identity-green.aeb2ef3b-e219-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:40981 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.3dbb1004-5bb8-432f-8fd8-b863bd29341d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_channel-green.c6fbd2ac-e219-11e5-a617-02429957d388
>        649.39 Mb /usr/libexec/mesos/mesos-health-check 
> --executor=(1)@10.92.32.63:48119 
> --health_check_json={"command":{"shell":true,"value":"docker exec 
> mesos-29e183be-f611-41b4-824c-2d05b052231b-S6.39731a2f-d29e-48d1-9927-34ab8c5f557d
>  sh -c \" curl --silent -f -X GET http:\/\/$HOST:$PORT0\/ops\/ping > 
> \/dev\/null 
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":90.0,"interval_seconds":2.0,"timeout_seconds":5.0}
>  --task_id=prod_talkk_push-green.601337e6-e219-11e5-a617-02429957d388
>          0.32 Mb grep --color=auto health_check
> {noformat}
> Killing the *mesos-health-check* process for each container fix our memory 
> issues (but I'm assuming health checks won't be reported anymore or 
> something):
> {noformat}
> root@mesos-slave-i-e3a9c724 # date ; free -m ; ps awwx | grep health_check | 
> grep -v grep | awk '{print $1}' | xargs -I% -P1 kill % ; date ; free -m
> Fri Mar  4 21:20:55 UTC 2016
>              total       used       free     shared    buffers     cached
> Mem:         16047      13538       2508          0       1140        774
> -/+ buffers/cache:      11623       4423
> Swap:            0          0          0
> Fri Mar  4 21:20:56 UTC 2016
>              total       used       free     shared    buffers     cached
> Mem:         16047       9101       6945          0       1140        774
> -/+ buffers/cache:       7186       8860
> Swap:            0          0          0
> {noformat}
> We're reverting to Marathon doing the health checks for now but would like to 
> emphasize it's happening across all our slaves (not an isolated issue).
> Thanks for looking into it :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to