[ 
https://issues.apache.org/jira/browse/MESOS-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426764#comment-16426764
 ] 

sri krishna commented on MESOS-8731:
------------------------------------

Based on the discussion with [~bmahler] in the community slack channel.

We have a lot of sessions to the UI (dev checking the mesos ui for container 
logs). The UI polls the state endpoint which is rather expensive at the current 
time. The interval of this is defined at 
[https://github.com/apache/mesos/blob/124c677c86c7b12ca4568f004895b8ca30d60dcf/src/webui/master/static/js/controllers.js#L91-L112]

As of now I have updated the mesos "controllers.js" to a large value and the 
latencies have come down.

> mesos master APIs become latent
> -------------------------------
>
>                 Key: MESOS-8731
>                 URL: https://issues.apache.org/jira/browse/MESOS-8731
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.4.0, 1.5.0
>            Reporter: sri krishna
>            Priority: Critical
>
> Over a period of time one of the UI API call to the master becomes latent. 
> Normally the request that takes less than a second takes up to 20 seconds 
> during peak. A lot of the dev team access the UI for logs.
> Below are my observations :
> In mesos "0.28.1-2.0.20.ubuntu1404"
> ################################################################
> # ab -n 1000 -c 10 
> "http://mesos-master1.mesos.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g";
> This is ApacheBench, Version 2.3 <$Revision: 1528965 $>
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Licensed to The Apache Software Foundation, http://www.apache.org/
> Benchmarking mesos-master1.mesos.bla.net (be patient)
> Completed 100 requests
> Completed 200 requests
> Completed 300 requests
> Completed 400 requests
> Completed 500 requests
> Completed 600 requests
> Completed 700 requests
> Completed 800 requests
> Completed 900 requests
> Completed 1000 requests
> Finished 1000 requests
> Server Software:
> Server Hostname: mesos-master1.mesos.bla.net
> Server Port: 5050
> Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
> Document Length: 3197 bytes
> Concurrency Level: 10
> Time taken for tests: 501.010 seconds
> Complete requests: 1000
> Failed requests: 954
>  (Connect: 0, Receive: 0, Length: 954, Exceptions: 0)
> Total transferred: 3304510 bytes
> HTML transferred: 3195510 bytes
> Requests per second: 2.00 [#/sec] (mean)
> Time per request: 5010.104 [ms] (mean)
> Time per request: 501.010 [ms] (mean, across all concurrent requests)
> Transfer rate: 6.44 [Kbytes/sec] received
> Connection Times (ms)
>  min mean[+/-sd] median max
> Connect: 0 0 0.0 0 0
> Processing: 321 4987 286.4 5007 5508
> Waiting: 321 4987 286.4 5007 5508
> Total: 321 4988 286.4 5007 5508
> Percentage of the requests served within a certain time (ms)
>  50% 5007
>  66% 5007
>  75% 5008
>  80% 5008
>  90% 5008
>  95% 5009
>  98% 5010
>  99% 5506
>  100% 5508 (longest request)
> ################################################################
>  
> In mesos 1.4 and 1.5 (versions 1.4.0-2.0.1 and 1.5.0-2.0.1) the response of 
> these APIs is quite high. 
> ################################################################
> # ab -n 1000 -c 10 
> "http://mesos-master3.stage.bla.net:5050/metrics/snapshot?jsonp=angular.callbacks._4g";
> This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
> Licensed to The Apache Software Foundation, http://www.apache.org/
> Benchmarking mesos-master3.stage.bla.net (be patient)
> Completed 100 requests
> Completed 200 requests
> Completed 300 requests
> Completed 400 requests
> Completed 500 requests
> ^C
> Server Software:
> Server Hostname: mesos-master3.stage.bla.net
> Server Port: 5050
> Document Path: /metrics/snapshot?jsonp=angular.callbacks._4g
> Document Length: 6596 bytes
> Concurrency Level: 10
> Time taken for tests: 1405.182 seconds
> Complete requests: 582
> Failed requests: 580
>  (Connect: 0, Receive: 0, Length: 580, Exceptions: 0)
> Total transferred: 3909986 bytes
> HTML transferred: 3846548 bytes
> Requests per second: 0.41 [#/sec] (mean)
> Time per request: 24144.024 [ms] (mean)
> Time per request: 2414.402 [ms] (mean, across all concurrent requests)
> Transfer rate: 2.72 [Kbytes/sec] received
> Connection Times (ms)
>  min mean[+/-sd] median max
> Connect: 0 0 0.0 0 0
> Processing: 15284 24058 2600.7 23937 31740
> Waiting: 15284 24058 2600.7 23937 31740
> Total: 15284 24059 2600.7 23938 31740
> Percentage of the requests served within a certain time (ms)
>  50% 23938
>  66% 25074
>  75% 25729
>  80% 26465
>  90% 27605
>  95% 28215
>  98% 29685
>  99% 30595
>  100% 31740 (longest request)
> ################################################################
> I think this is causing the others APIs like "/master/slaves/ and "/metrics" 
> to become latent. 
> At this point we are forcing a re-elect of the the master to bring the times 
> down. What can I do to bring this times down? The load on the box is quite 
> less. The load average does not cross 2 on a 8 core box.
> Let me know if any further info is required. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to