[
https://issues.apache.org/jira/browse/MESOS-4740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Cong Wang updated MESOS-4740:
-----------------------------
Description:
David Robinson noticed retrieving metrics/snapshot statistics could be very
inefficient and cause Mesos master stuck.
{noformat}
[root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot
real 2m7.302s
user 0m0.001s
sys 0m0.004s
{noformat}
MESOS-1287 introduces a timeout parameter for this query, but for observers
like ours they are not aware of such URL-specific parameter, so we need:
1) We should always have a timeout and set some default value to it
2) Investigate why metrics/snapshot could take such a long time to complete
under load, since we don't use history for these statistics and the values are
just some atomic read.
was:
David Robinson noticed retrieving metrics/snapshot statistics could be very
inefficient and cause Mesos master stuck.
{noformat}
[root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot
real 2m7.302s
user 0m0.001s
sys 0m0.004s
{noformat}
>From a quick glance of the code, this *seems* due to we sort all the values
>saved in the time series when calculating percentiles.
{noformat}
foreach (const typename TimeSeries<T>::Value& value, values_) {
values.push_back(value.data);
}
std::sort(values.begin(), values.end());
{noformat}
> Improve metrics/snapshot performace
> -----------------------------------
>
> Key: MESOS-4740
> URL: https://issues.apache.org/jira/browse/MESOS-4740
> Project: Mesos
> Issue Type: Task
> Reporter: Cong Wang
> Assignee: Cong Wang
>
> David Robinson noticed retrieving metrics/snapshot statistics could be very
> inefficient and cause Mesos master stuck.
> {noformat}
> [root@atla-bny-34-sr1 ~]# time curl -s localhost:5051/metrics/snapshot
> real 2m7.302s
> user 0m0.001s
> sys 0m0.004s
> {noformat}
> MESOS-1287 introduces a timeout parameter for this query, but for observers
> like ours they are not aware of such URL-specific parameter, so we need:
> 1) We should always have a timeout and set some default value to it
> 2) Investigate why metrics/snapshot could take such a long time to complete
> under load, since we don't use history for these statistics and the values
> are just some atomic read.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)