[ 
https://issues.apache.org/jira/browse/MESOS-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cong Wang updated MESOS-4740:
-----------------------------
    Description: 
David Robinson noticed retrieving metrics/snapshot statistics could be very 
inefficient and cause Mesos master stuck.

{noformat}
[root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot

real    2m7.302s
user    0m0.001s
sys    0m0.004s
{noformat}

MESOS-1287 introduces a timeout parameter for this query, but for observers 
like ours they are not aware of such URL-specific parameter, so we need:

1) We should always have a timeout and set some default value to it

2) Investigate why metrics/snapshot could take such a long time to complete 
under load, since we don't use history for these statistics and the values are 
just some atomic read.


  was:
David Robinson noticed retrieving metrics/snapshot statistics could be very 
inefficient and cause Mesos master stuck.

{noformat}
[root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot

real    2m7.302s
user    0m0.001s
sys    0m0.004s
{noformat}

>From a quick glance of the code, this *seems* due to we sort all the values 
>saved in the time series when calculating percentiles.

{noformat}
    foreach (const typename TimeSeries<T>::Value& value, values_) {
      values.push_back(value.data);
    }

    std::sort(values.begin(), values.end());
{noformat}



> Improve metrics/snapshot performace
> -----------------------------------
>
>                 Key: MESOS-4740
>                 URL: https://issues.apache.org/jira/browse/MESOS-4740
>             Project: Mesos
>          Issue Type: Task
>            Reporter: Cong Wang
>            Assignee: Cong Wang
>
> David Robinson noticed retrieving metrics/snapshot statistics could be very 
> inefficient and cause Mesos master stuck.
> {noformat}
> [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot
> real    2m7.302s
> user    0m0.001s
> sys    0m0.004s
> {noformat}
> MESOS-1287 introduces a timeout parameter for this query, but for observers 
> like ours they are not aware of such URL-specific parameter, so we need:
> 1) We should always have a timeout and set some default value to it
> 2) Investigate why metrics/snapshot could take such a long time to complete 
> under load, since we don't use history for these statistics and the values 
> are just some atomic read.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to