[
https://issues.apache.org/jira/browse/MESOS-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427880#comment-16427880
]
Benjamin Mahler commented on MESOS-8731:
----------------------------------------
Linking in MESOS-8345, since this appears to be a case of the master being
overwhelmed by state polling from many webui instances.
> mesos master APIs become latent
> -------------------------------
>
> Key: MESOS-8731
> URL: https://issues.apache.org/jira/browse/MESOS-8731
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.4.0, 1.5.0
> Reporter: sri krishna
> Priority: Critical
>
> Over a period of time one of the UI API call to the master becomes latent.
> Normally the request that takes less than a second takes up to 20 seconds
> during peak. A lot of the dev team access the UI for logs.
> Below are my observations :
> In mesos "0.28.1-2.0.20.ubuntu1404"
> ################################################################
> # ab -n 1000 -c 10
> "http://mesos-master1.mesos.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g"
> This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Licensed to The Apache Software Foundation, http://www.apache.org/
> Benchmarking mesos-master1.mesos.bla.net (be patient)
> Completed 100 requests
> Completed 200 requests
> Completed 300 requests
> Completed 400 requests
> Completed 500 requests
> Completed 600 requests
> Completed 700 requests
> Completed 800 requests
> Completed 900 requests
> Completed 1000 requests
> Finished 1000 requests
> Server Software:
> Server Hostname: mesos-master1.mesos.bla.net
> Server Port: 5050
> Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
> Document Length: 3197 bytes
> Concurrency Level: 10
> Time taken for tests: 501.010 seconds
> Complete requests: 1000
> Failed requests: 954
> (Connect: 0, Receive: 0, Length: 954, Exceptions: 0)
> Total transferred: 3304510 bytes
> HTML transferred: 3195510 bytes
> Requests per second: 2.00 [#/sec] (mean)
> Time per request: 5010.104 [ms] (mean)
> Time per request: 501.010 [ms] (mean, across all concurrent requests)
> Transfer rate: 6.44 [Kbytes/sec] received
> Connection Times (ms)
> min mean[+/-sd] median max
> Connect: 0 0 0.0 0 0
> Processing: 321 4987 286.4 5007 5508
> Waiting: 321 4987 286.4 5007 5508
> Total: 321 4988 286.4 5007 5508
> Percentage of the requests served within a certain time (ms)
> 50% 5007
> 66% 5007
> 75% 5008
> 80% 5008
> 90% 5008
> 95% 5009
> 98% 5010
> 99% 5506
> 100% 5508 (longest request)
> ################################################################
>
> In mesos 1.4 and 1.5 (versions 1.4.0-2.0.1 and 1.5.0-2.0.1) the response of
> these APIs is quite high.
> ################################################################
> # ab -n 1000 -c 10
> "http://mesos-master3.stage.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g"
> This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Licensed to The Apache Software Foundation, http://www.apache.org/
> Benchmarking mesos-master3.stage.bla.net (be patient)
> Completed 100 requests
> Completed 200 requests
> Completed 300 requests
> Completed 400 requests
> Completed 500 requests
> ^C
> Server Software:
> Server Hostname: mesos-master3.stage.bla.net
> Server Port: 5050
> Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
> Document Length: 6596 bytes
> Concurrency Level: 10
> Time taken for tests: 1405.182 seconds
> Complete requests: 582
> Failed requests: 580
> (Connect: 0, Receive: 0, Length: 580, Exceptions: 0)
> Total transferred: 3909986 bytes
> HTML transferred: 3846548 bytes
> Requests per second: 0.41 [#/sec] (mean)
> Time per request: 24144.024 [ms] (mean)
> Time per request: 2414.402 [ms] (mean, across all concurrent requests)
> Transfer rate: 2.72 [Kbytes/sec] received
> Connection Times (ms)
> min mean[+/-sd] median max
> Connect: 0 0 0.0 0 0
> Processing: 15284 24058 2600.7 23937 31740
> Waiting: 15284 24058 2600.7 23937 31740
> Total: 15284 24059 2600.7 23938 31740
> Percentage of the requests served within a certain time (ms)
> 50% 23938
> 66% 25074
> 75% 25729
> 80% 26465
> 90% 27605
> 95% 28215
> 98% 29685
> 99% 30595
> 100% 31740 (longest request)
> ################################################################
> I think this is causing the others APIs like "/master/slaves/ and "/metrics"
> to become latent.
> At this point we are forcing a re-elect of the the master to bring the times
> down. What can I do to bring this times down? The load on the box is quite
> less. The load average does not cross 2 on a 8 core box.
> Let me know if any further info is required.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)