Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-19 Thread Brad Hubbard
On Sat, May 19, 2018 at 5:01 PM, Uwe Sauter  wrote:
> The mistery is that these blocked requests occur numerously when at
> least
> one of the 6 servers is booted with kernel 4.15.17, if all are running
> 4.13.16 the number of blocked requests is infrequent and low.


 Sounds like you need to profile your two kernel versions and work out
 why one is under-performing.

>>>
>>> Well, the problem is that I see this behavior only in our production
>>> system (6 hosts and 22 OSDs total). The test system I have is
>>> a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no
>>> sign of this possible regression…
>>
>>
>> Are you saying you can't gather performance data from your production
>> system?
>
>
> As far as I can tell the issue only occurs on the production cluster.
> Without a way to reproduce
> on the test cluster I can't bisect the kernels as on the production cluster
> runs our central
> infrastructure and each time the active LDAP is stuck, most of the other
> services are stuck as well…
> My colleagues won't appreciate that.
>
> What other kind of performance data would you have collected?
>

On systems where this can be reproduced I would use tools like 'perf
top', pvp, collectd and maybe something like the following to capture
data that can be analysed to define the nature of the issue.

// for rhel6 and rhel7 so may need modification

# { top -n 5 -b > /tmp/top.out; \
vmstat 1 50 > /tmp/vm.out; \
iostat -tkx -p ALL 1 10 > /tmp/io.out; \
mpstat -A 1 10 > /tmp/mp.out; \
ps auwwx > /tmp/ps1.out; \
ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan > /tmp/ps2.out; \
sar -A 1 50 > /tmp/sar.out; \
free > /tmp/free.out; } ; tar -cjvf outputs_$(hostname)_$(date
+"%d-%b-%Y_%H%M").tar.bz2 /tmp/*.out

As you've already pointed out this currently seems to be a kernel
performance issue but analysis of this sort of data should help you
narrow it down.

Of course, all of this relies on you being able to reproduce the
issue, but maybe you can gather a baseline to begin with so you have
something to compare to when you are in a position to gather perf data
during an issue.

At the same time I'd suggest pursuing this with Proxmox and/or Ubuntu
to see if they have anything to offer.

-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-19 Thread Uwe Sauter



Am 19.05.2018 um 01:45 schrieb Brad Hubbard:

On Thu, May 17, 2018 at 6:06 PM, Uwe Sauter  wrote:

Brad,

thanks for the bug report. This is exactly the problem I am having (log-wise).


You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205



the cluster is an Proxmox installation which is based on an Ubuntu kernel.

# ceph -v
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
(stable)

The mistery is that these blocked requests occur numerously when at least
one of the 6 servers is booted with kernel 4.15.17, if all are running
4.13.16 the number of blocked requests is infrequent and low.


Sounds like you need to profile your two kernel versions and work out
why one is under-performing.



Well, the problem is that I see this behavior only in our production system (6 
hosts and 22 OSDs total). The test system I have is
a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no sign 
of this possible regression…


Are you saying you can't gather performance data from your production system?


As far as I can tell the issue only occurs on the production cluster. Without a 
way to reproduce
on the test cluster I can't bisect the kernels as on the production cluster 
runs our central
infrastructure and each time the active LDAP is stuck, most of the other 
services are stuck as well…
My colleagues won't appreciate that.

What other kind of performance data would you have collected?

Uwe






Regards,

 Uwe





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-18 Thread Brad Hubbard
On Thu, May 17, 2018 at 6:06 PM, Uwe Sauter  wrote:
> Brad,
>
> thanks for the bug report. This is exactly the problem I am having (log-wise).

 You don't give any indication what version you are running but see
 https://tracker.ceph.com/issues/23205
>>>
>>>
>>> the cluster is an Proxmox installation which is based on an Ubuntu kernel.
>>>
>>> # ceph -v
>>> ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
>>> (stable)
>>>
>>> The mistery is that these blocked requests occur numerously when at least
>>> one of the 6 servers is booted with kernel 4.15.17, if all are running
>>> 4.13.16 the number of blocked requests is infrequent and low.
>>
>> Sounds like you need to profile your two kernel versions and work out
>> why one is under-performing.
>>
>
> Well, the problem is that I see this behavior only in our production system 
> (6 hosts and 22 OSDs total). The test system I have is
> a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no sign 
> of this possible regression…

Are you saying you can't gather performance data from your production system?

>
>
> Regards,
>
> Uwe



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-17 Thread Uwe Sauter
Brad,

thanks for the bug report. This is exactly the problem I am having (log-wise).
>>>
>>> You don't give any indication what version you are running but see
>>> https://tracker.ceph.com/issues/23205
>>
>>
>> the cluster is an Proxmox installation which is based on an Ubuntu kernel.
>>
>> # ceph -v
>> ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
>> (stable)
>>
>> The mistery is that these blocked requests occur numerously when at least
>> one of the 6 servers is booted with kernel 4.15.17, if all are running
>> 4.13.16 the number of blocked requests is infrequent and low.
> 
> Sounds like you need to profile your two kernel versions and work out
> why one is under-performing.
> 

Well, the problem is that I see this behavior only in our production system (6 
hosts and 22 OSDs total). The test system I have is
a bit smaller (only 3 hosts with 12 OSDs on older hardware) and shows no sign 
of this possible regression…


Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-17 Thread Brad Hubbard
On Thu, May 17, 2018 at 4:16 PM, Uwe Sauter  wrote:
> Hi,
>
>>> I'm currently chewing on an issue regarding "slow requests are blocked".
>>> I'd like to identify the OSD that is causing those events
>>> once the cluster is back to HEALTH_OK (as I have no monitoring yet that
>>> would get this info in realtime).
>>>
>>> Collecting this information could help identify aging disks if you were
>>> able to accumulate and analyze which OSD had blocking
>>> requests in the past and how often those events occur.
>>>
>>> My research so far let's me think that this information is only available
>>> as long as the requests are actually blocked. Is this
>>> correct?
>>
>>
>> You don't give any indication what version you are running but see
>> https://tracker.ceph.com/issues/23205
>
>
> the cluster is an Proxmox installation which is based on an Ubuntu kernel.
>
> # ceph -v
> ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous
> (stable)
>
> The mistery is that these blocked requests occur numerously when at least
> one of the 6 servers is booted with kernel 4.15.17, if all are running
> 4.13.16 the number of blocked requests is infrequent and low.

Sounds like you need to profile your two kernel versions and work out
why one is under-performing.

>
>
> Regards,
>
> Uwe



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-17 Thread Uwe Sauter

Hi,


I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?


You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205


the cluster is an Proxmox installation which is based on an Ubuntu kernel.

# ceph -v
ceph version 12.2.5 (dfcb7b53b2e4fcd2a5af0240d4975adc711ab96e) luminous (stable)

The mistery is that these blocked requests occur numerously when at least one of the 6 servers is booted with kernel 
4.15.17, if all are running 4.13.16 the number of blocked requests is infrequent and low.



Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Brad Hubbard
On Wed, May 16, 2018 at 6:16 PM, Uwe Sauter  wrote:
> Hi folks,
>
> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
> like to identify the OSD that is causing those events
> once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
> get this info in realtime).
>
> Collecting this information could help identify aging disks if you were able 
> to accumulate and analyze which OSD had blocking
> requests in the past and how often those events occur.
>
> My research so far let's me think that this information is only available as 
> long as the requests are actually blocked. Is this
> correct?

You don't give any indication what version you are running but see
https://tracker.ceph.com/issues/23205

>
> MON logs only show that those events occure and how many requests are in 
> blocking state but no indication of which OSD is
> affected. Is there a way to identify blocking requests from the OSD log files?
>
>
> On a side note: I was trying to write a small Python script that would 
> extract this kind of information in realtime but while I
> was able to register a MonitorLog callback that would receive the same 
> messages as you would get with "ceph -w" I haven's seen in
> the librados Python bindings documentation the possibility to do the 
> equivalent of "ceph health detail". Any suggestions on how to
> get the blocking OSDs via librados?
>
>
> Thanks,
>
> Uwe
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai

On 05/16/2018 07:18 AM, Uwe Sauter wrote:
> Hi Mohamad,
>
>>
>> I think this is what you're looking for:
>>
>> $> ceph daemon osd.X dump_historic_slow_ops
>>
>> which gives you recent slow operations, as opposed to
>>
>> $> ceph daemon osd.X dump_blocked_ops
>>
>> which returns current blocked operations. You can also add a filter to
>> those commands.
> Thanks for these commands. I'll have a look into those. If I understand these 
> correctly it means that I need to run these at each
> server for each OSD instead of at a central location, is that correct?
>

That's the case, as it uses the admin socket.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi Mohamad,


>> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
>> like to identify the OSD that is causing those events
>> once the cluster is back to HEALTH_OK (as I have no monitoring yet that 
>> would get this info in realtime).
>>
>> Collecting this information could help identify aging disks if you were able 
>> to accumulate and analyze which OSD had blocking
>> requests in the past and how often those events occur.
>>
>> My research so far let's me think that this information is only available as 
>> long as the requests are actually blocked. Is this
>> correct?
> 
> I think this is what you're looking for:
> 
> $> ceph daemon osd.X dump_historic_slow_ops
> 
> which gives you recent slow operations, as opposed to
> 
> $> ceph daemon osd.X dump_blocked_ops
> 
> which returns current blocked operations. You can also add a filter to
> those commands.

Thanks for these commands. I'll have a look into those. If I understand these 
correctly it means that I need to run these at each
server for each OSD instead of at a central location, is that correct?

Regards,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Mohamad Gebai
Hi,

On 05/16/2018 04:16 AM, Uwe Sauter wrote:
> Hi folks,
>
> I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
> like to identify the OSD that is causing those events
> once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
> get this info in realtime).
>
> Collecting this information could help identify aging disks if you were able 
> to accumulate and analyze which OSD had blocking
> requests in the past and how often those events occur.
>
> My research so far let's me think that this information is only available as 
> long as the requests are actually blocked. Is this
> correct?

I think this is what you're looking for:

$> ceph daemon osd.X dump_historic_slow_ops

which gives you recent slow operations, as opposed to

$> ceph daemon osd.X dump_blocked_ops

which returns current blocked operations. You can also add a filter to
those commands.

Mohamad

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

2018-05-16 Thread Uwe Sauter
Hi folks,

I'm currently chewing on an issue regarding "slow requests are blocked". I'd 
like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would 
get this info in realtime).

Collecting this information could help identify aging disks if you were able to 
accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as 
long as the requests are actually blocked. Is this
correct?

MON logs only show that those events occure and how many requests are in 
blocking state but no indication of which OSD is
affected. Is there a way to identify blocking requests from the OSD log files?


On a side note: I was trying to write a small Python script that would extract 
this kind of information in realtime but while I
was able to register a MonitorLog callback that would receive the same messages 
as you would get with "ceph -w" I haven's seen in
the librados Python bindings documentation the possibility to do the equivalent 
of "ceph health detail". Any suggestions on how to
get the blocking OSDs via librados?


Thanks,

Uwe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com