Re: [ceph-users] What are you doing to locate performance issues in a Ceph cluster?

Dan Ryder (daryder) Wed, 08 Apr 2015 07:25:58 -0700

Yes, the unit is in seconds for those latencies. The sum/avgcount is the 
average since the daemon was (re)started.

If you're interested, I've co-authored a collectd plugin which captures data 
from Ceph daemons - built into the plugin I give the option to calculate either 
the long-run avg (sum/avgcount) or the last-poll delta 
(sum_now-sum_last_poll/avgcount_now-avgcount_last_poll). It's been added to the 
latest collectd branch (https://github.com/collectd/collectd).

Dan Ryder

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Francois Lafont
Sent: Wednesday, April 08, 2015 10:11 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What are you doing to locate performance issues in a 
Ceph cluster?

Chris Kitzmiller wrote:

>> ~# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf
>>
>>  [...]
>>
>>  "osd": { "opq": 0,
>>      "op_wip": 0,
>>      "op": 3566,
>>      "op_in_bytes": 208803635,
>>      "op_out_bytes": 146962506,
>>      "op_latency": { "avgcount": 3566,
>>          "sum": 100.330695000},
>>      "op_process_latency": { "avgcount": 3566,
>>          "sum": 84.702772000},
>>      "op_r": 471,
>>      "op_r_out_bytes": 146851024,
>>      "op_r_latency": { "avgcount": 471,
>>          "sum": 1.329795000},
>>
>>   [...]
>>
>> Is the value of "op_r_latency" (ie 1.329ms above)?
>> In this case, I don't understand the meaning of "avgcount"
>> and "sum".
>>
>> "sum" is the sum of what?
>> "avgcount" is the average of what?
> 
> There are a bunch of these avgcount/sum pairs and, from what I've gleaned, 
> you're to simply divide sum by avgcount to get the mean of that particular 
> stat over whatever time frame it is measuring.

Err..., I'm sorry, I'm not sure to well understand. If I take the values of 
op_r_latency above, I have:

    sum/avgcount = 1.329795000/471 = 0.002823344

0,002823344ms would be my latency of read operation?
It seems to me impossible (unfortunately ;)) or maybe the unit is in seconds?
In this case 2.823344ms could be a plausible value. In any case, I don't 
understand the name "avgcount". The name "count" seems to me more logical (but 
maybe I don't really have understand its meaning).

If I see the source code ./src/common/perf_counters.cc, it seems to me that, 
indeed, the number is in seconds but I'm (really) not a c++ expert.
Is possible to confirm to me that?

Another thing: if I understand well, the value sum/avgcount is an average of 
latency, average computed from the start of the osd daemon. So, after lot of 
times, the average will be more stable and it no longer incur variation.
Is it possible to restart the counters? I noticed that restarting the daemon 
works but it's a little drastic.

--
François Lafont
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] What are you doing to locate performance issues in a Ceph cluster?

Reply via email to