Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Alex Gorbachev
 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).

Thanks, Jan. I asked this question of Vmware as well, I think the
problem is specific to a given iSCSI session, so wondering if that's
strictly the job of the target?  Do you know of any specific SCSI
settings that mitigate this kind of issue?  Basically, give up on a
session and terminate it and start a new one should an RBD not
respond?

As I understand, RBD simply never gives up.  If an OSD does not
respond but is still technically up and in, Ceph will retry IOs
forever.  I think RBD and Ceph need a timeout mechanism for this.

Best regards,
Alex

 Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:

 Hi Alex,

 Currently RBD+LIO+ESX is broken.

 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.

 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.

 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.

 Either tgt or SCST seem to be pretty stable in testing.

 Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs

 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.

 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.

 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:

 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests

 However, ceph osd perf is not showing high latency on osd 4:

 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10

 SMART state for osd 4 disk is OK.  The OSD in up and in:

 root@lab2-mon1:/var/log/ceph# ceph osd tree
 ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -80 root ssd
 -7 14.71997 root platter
 -3  7.12000 host croc3
 22  0.89000 osd.22  up  1.0  1.0
 15  0.89000 osd.15  up  1.0  1.0
 16  0.89000 osd.16  up  1.0  1.0
 13  0.89000 osd.13  up  1.0  1.0
 18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
 11  0.89000 osd.11  up  1.0  1.0
 20  0.89000 osd.20  up  1.0  1.0
 -4  0.47998 host croc2
 10  0.06000 osd

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Jan Schermer
I never actually set up iSCSI with VMware, I just had to research various 
VMware storage options when we had a SAN-probelm at a former job... But I can 
take a look at it again if you want me to.

Is it realy deadlocked when this issue occurs? 
What I think is partly responsible for this situation is that the iSCSI LUN 
queues fill up and that's what actually kills your IO - VMware lowers queue 
depth to 1 in that situation and it can take a really long time to recover 
(especially if one of the LUNs  on the target constantly has problems, or when 
heavy IO hammers the adapter) - you should never fill this queue, ever.
iSCSI will likely be innocent victim in the chain, not the cause of the issues.

Ceph should gracefully handle all those situations, you just need to set the 
timeouts right. I have it set so that whatever happens the OSD can only delay 
work for 40s and then it is marked down - at that moment all IO start flowing 
again. 

You should take this to VMware support, they should be able to tell whether the 
problem is in iSCSI target (then you can take a look at how that behaves) or in 
the initiator settings. Though in my experience after two visits from their 
foremost experts I had to google everything myself because they were clueless 
- YMMV.

The root cause is however slow ops in Ceph, and I have no idea why you'd have 
them if the OSDs come back up - maybe one of them is really deadlocked or 
backlogged in some way? I found that when OSDs are dead but up they don't 
respond to ceph tell osd.xxx ... so try if they all respond in a timely 
manner, that should help pinpoint the bugger.

Jan


 On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote:
 
 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).
 
 Thanks, Jan. I asked this question of Vmware as well, I think the
 problem is specific to a given iSCSI session, so wondering if that's
 strictly the job of the target?  Do you know of any specific SCSI
 settings that mitigate this kind of issue?  Basically, give up on a
 session and terminate it and start a new one should an RBD not
 respond?
 
 As I understand, RBD simply never gives up.  If an OSD does not
 respond but is still technically up and in, Ceph will retry IOs
 forever.  I think RBD and Ceph need a timeout mechanism for this.
 
 Best regards,
 Alex
 
 Jan
 
 
 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Alex,
 
 Currently RBD+LIO+ESX is broken.
 
 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.
 
 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.
 
 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.
 
 Either tgt or SCST seem to be pretty stable in testing.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs
 
 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.
 
 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.
 
 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:
 
 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests
 
 However, ceph osd perf is not showing high latency on osd 4:
 
 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
 0 0   13
 1 00
 2 00
 3   172

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Alex Gorbachev
HI Jan,

On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote:
 I never actually set up iSCSI with VMware, I just had to research various 
 VMware storage options when we had a SAN-probelm at a former job... But I can 
 take a look at it again if you want me to.

Thank you, I don't want to waste your time as I have asked Vmware TAP
to research that - I will communicate back anything with which they
respond.


 Is it realy deadlocked when this issue occurs?
 What I think is partly responsible for this situation is that the iSCSI LUN 
 queues fill up and that's what actually kills your IO - VMware lowers queue 
 depth to 1 in that situation and it can take a really long time to recover 
 (especially if one of the LUNs  on the target constantly has problems, or 
 when heavy IO hammers the adapter) - you should never fill this queue, ever.
 iSCSI will likely be innocent victim in the chain, not the cause of the 
 issues.

Completely agreed, so iSCSI's job then is to properly communicate to
the initiator that it cannot do what it is asked to do and quit the
IO.


 Ceph should gracefully handle all those situations, you just need to set the 
 timeouts right. I have it set so that whatever happens the OSD can only delay 
 work for 40s and then it is marked down - at that moment all IO start flowing 
 again.

What setting in ceph do you use to do that?  is that
mon_osd_down_out_interval?  I think stopping slow OSDs is the answer
to the root of the problem - so far I only know to do ceph osd perf
and look at latencies.


 You should take this to VMware support, they should be able to tell whether 
 the problem is in iSCSI target (then you can take a look at how that behaves) 
 or in the initiator settings. Though in my experience after two visits from 
 their foremost experts I had to google everything myself because they were 
 clueless - YMMV.

I am hoping the TAP Elite team can do better...but we'll see...


 The root cause is however slow ops in Ceph, and I have no idea why you'd have 
 them if the OSDs come back up - maybe one of them is really deadlocked or 
 backlogged in some way? I found that when OSDs are dead but up they don't 
 respond to ceph tell osd.xxx ... so try if they all respond in a timely 
 manner, that should help pinpoint the bugger.

I think I know in this case - there are some PCIe AER/Bus errors and
TLP Header messages strewing across the console of one OSD machine -
ceph osd perf showing latencies aboce a second per OSD, but only when
IO is done to those OSDs.  I am thankful this is not production
storage, but worried of this situation in production - the OSDs are
staying up and in, but their latencies are slowing clusterwide IO to a
crawl.  I am trying to envision this situation in production and how
would one find out what is slowing everything down without guessing.

Regards,
Alex



 Jan


 On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote:

 This can be tuned in the iSCSI initiation on VMware - look in advanced 
 settings on your ESX hosts (at least if you use the software initiator).

 Thanks, Jan. I asked this question of Vmware as well, I think the
 problem is specific to a given iSCSI session, so wondering if that's
 strictly the job of the target?  Do you know of any specific SCSI
 settings that mitigate this kind of issue?  Basically, give up on a
 session and terminate it and start a new one should an RBD not
 respond?

 As I understand, RBD simply never gives up.  If an OSD does not
 respond but is still technically up and in, Ceph will retry IOs
 forever.  I think RBD and Ceph need a timeout mechanism for this.

 Best regards,
 Alex

 Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:

 Hi Alex,

 Currently RBD+LIO+ESX is broken.

 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.

 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.

 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and 
 it
 will all work much better.

 Either tgt or SCST seem to be pretty stable in testing.

 Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs

 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.

 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Jan Schermer
This can be tuned in the iSCSI initiation on VMware - look in advanced settings 
on your ESX hosts (at least if you use the software initiator).

Jan


 On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
 Hi Alex,
 
 Currently RBD+LIO+ESX is broken.
 
 The problem is caused by the RBD device not handling device aborts properly
 causing LIO and ESXi to enter a death spiral together.
 
 If something in the Ceph cluster causes an IO to take longer than 10
 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
 as you have seen it never recovers.
 
 Mike Christie from Redhat is doing a lot of work on this currently, so
 hopefully in the future there will be a direct RBD interface into LIO and it
 will all work much better.
 
 Either tgt or SCST seem to be pretty stable in testing.
 
 Nick
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs
 
 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
 LIO -
 ESXi configuration, but the problem is more general.
 
 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
 neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
 point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.
 
 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
 from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
 I
 could see was:
 
 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests
 
 However, ceph osd perf is not showing high latency on osd 4:
 
 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10
 
 SMART state for osd 4 disk is OK.  The OSD in up and in:
 
 root@lab2-mon1:/var/log/ceph# ceph osd tree
 ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -80 root ssd
 -7 14.71997 root platter
 -3  7.12000 host croc3
 22  0.89000 osd.22  up  1.0  1.0
 15  0.89000 osd.15  up  1.0  1.0
 16  0.89000 osd.16  up  1.0  1.0
 13  0.89000 osd.13  up  1.0  1.0
 18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
 11  0.89000 osd.11  up  1.0  1.0
 20  0.89000 osd.20  up  1.0  1.0
 -4  0.47998 host croc2
 10  0.06000 osd.10  up  1.0  1.0
 12  0.06000 osd.12  up  1.0  1.0
 14  0.06000 osd.14  up  1.0  1.0
 17  0.06000 osd.17  up  1.0  1.0
 19  0.06000 osd.19  up  1.0  1.0
 21  0.06000 osd.21  up  1.0  1.0
 9  0.06000 osd.9   up  1.0  1.0
 23  0.06000 osd.23  up  1.0  1.0
 -2  7.12000 host croc1
 7  0.89000 osd.7   up  1.0

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-24 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 24 August 2015 18:06
 To: Jan Schermer j...@schermer.cz
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] Slow responding OSDs are not OUTed and cause
 RBD client IO hangs
 
 HI Jan,
 
 On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote:
  I never actually set up iSCSI with VMware, I just had to research
various
 VMware storage options when we had a SAN-probelm at a former job... But I
 can take a look at it again if you want me to.
 
 Thank you, I don't want to waste your time as I have asked Vmware TAP to
 research that - I will communicate back anything with which they respond.
 
 
  Is it realy deadlocked when this issue occurs?
  What I think is partly responsible for this situation is that the iSCSI
LUN
 queues fill up and that's what actually kills your IO - VMware lowers
queue
 depth to 1 in that situation and it can take a really long time to recover
 (especially if one of the LUNs  on the target constantly has problems, or
 when heavy IO hammers the adapter) - you should never fill this queue,
 ever.
  iSCSI will likely be innocent victim in the chain, not the cause of the
issues.
 
 Completely agreed, so iSCSI's job then is to properly communicate to the
 initiator that it cannot do what it is asked to do and quit the IO.

It's not a queue full or queue throttling issue. ESXi detects a slow IO
which I believe is when an IO takes longer than 10 seconds, it then tries to
send an abort message to the target so it can then retry. However the RBD
client doesn't handle the abort message passed to it from LIO. I'm not sure
what quite happens next but between LIO and ESXi neither makes the decision
to ignore the abort and so both enter a standoff with each other.

 
 
  Ceph should gracefully handle all those situations, you just need to set
the
 timeouts right. I have it set so that whatever happens the OSD can only
delay
 work for 40s and then it is marked down - at that moment all IO start
flowing
 again.
 
 What setting in ceph do you use to do that?  is that
 mon_osd_down_out_interval?  I think stopping slow OSDs is the answer to
 the root of the problem - so far I only know to do ceph osd perf
 and look at latencies.
 

You can maybe adjust some of the timeouts to make Ceph pause for less time
to hopefully make sure all IO is processed in under 10s, but you increase
the risk of OSD's randomly dropping out and there are probably still quite a
few cases where IO could still take longer than 10s.

 
  You should take this to VMware support, they should be able to tell
 whether the problem is in iSCSI target (then you can take a look at how
that
 behaves) or in the initiator settings. Though in my experience after two
visits
 from their foremost experts I had to google everything myself because
 they were clueless - YMMV.
 
 I am hoping the TAP Elite team can do better...but we'll see...
 
 
  The root cause is however slow ops in Ceph, and I have no idea why you'd
 have them if the OSDs come back up - maybe one of them is really
 deadlocked or backlogged in some way? I found that when OSDs are dead
 but up they don't respond to ceph tell osd.xxx ... so try if they all
respond
 in a timely manner, that should help pinpoint the bugger.
 
 I think I know in this case - there are some PCIe AER/Bus errors and TLP
 Header messages strewing across the console of one OSD machine - ceph
 osd perf showing latencies aboce a second per OSD, but only when IO is
 done to those OSDs.  I am thankful this is not production storage, but
worried
 of this situation in production - the OSDs are staying up and in, but
their
 latencies are slowing clusterwide IO to a crawl.  I am trying to envision
this
 situation in production and how would one find out what is slowing
 everything down without guessing.
 
 Regards,
 Alex
 
 
 
  Jan
 
 
  On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com
 wrote:
 
  This can be tuned in the iSCSI initiation on VMware - look in advanced
 settings on your ESX hosts (at least if you use the software initiator).
 
  Thanks, Jan. I asked this question of Vmware as well, I think the
  problem is specific to a given iSCSI session, so wondering if that's
  strictly the job of the target?  Do you know of any specific SCSI
  settings that mitigate this kind of issue?  Basically, give up on a
  session and terminate it and start a new one should an RBD not
  respond?
 
  As I understand, RBD simply never gives up.  If an OSD does not
  respond but is still technically up and in, Ceph will retry IOs
  forever.  I think RBD and Ceph need a timeout mechanism for this.
 
  Best regards,
  Alex
 
  Jan
 
 
  On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote:
 
  Hi Alex,
 
  Currently RBD+LIO+ESX is broken.
 
  The problem is caused by the RBD device not handling device aborts
  properly causing LIO

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-23 Thread Nick Fisk
Hi Alex,

Currently RBD+LIO+ESX is broken.

The problem is caused by the RBD device not handling device aborts properly
causing LIO and ESXi to enter a death spiral together.

If something in the Ceph cluster causes an IO to take longer than 10
seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
as you have seen it never recovers.

Mike Christie from Redhat is doing a lot of work on this currently, so
hopefully in the future there will be a direct RBD interface into LIO and it
will all work much better.

Either tgt or SCST seem to be pretty stable in testing.

Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: 23 August 2015 02:17
 To: ceph-users ceph-users@lists.ceph.com
 Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
 client IO hangs
 
 Hello, this is an issue we have been suffering from and researching along
 with a good number of other Ceph users, as evidenced by the recent posts.
 In our specific case, these issues manifest themselves in a RBD - iSCSI
LIO -
 ESXi configuration, but the problem is more general.
 
 When there is an issue on OSD nodes (examples: network hangs/blips, disk
 HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
 slowly or with significant delays.  ceph osd perf does not show this,
neither
 does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
point
 where the client times out, crashes or displays other unsavory behavior -
 operationally this crashes production processes.
 
 Today in our lab we had a disk controller issue, which brought an OSD node
 down.  Upon restart, the OSDs started up and rejoined into the cluster.
 However, immediately all IOs started hanging for a long time and aborts
from
 ESXi - LIO were not succeeding in canceling these IOs.  The only warning
I
 could see was:
 
 root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
 requests are blocked  32 sec;
 1 osds have slow requests 30 ops are blocked  2097.15 sec
 30 ops are blocked  2097.15 sec on osd.4
 1 osds have slow requests
 
 However, ceph osd perf is not showing high latency on osd 4:
 
 root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
 fs_apply_latency(ms)
   0 0   13
   1 00
   2 00
   3   172  208
   4 00
   5 00
   6 01
   7 00
   8   174  819
   9 6   10
  10 01
  11 01
  12 35
  13 01
  14 7   23
  15 01
  16 00
  17 59
  18 01
  1910   18
  20 00
  21 00
  22 01
  23 5   10
 
 SMART state for osd 4 disk is OK.  The OSD in up and in:
 
 root@lab2-mon1:/var/log/ceph# ceph osd tree
 ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -80 root ssd
 -7 14.71997 root platter
 -3  7.12000 host croc3
 22  0.89000 osd.22  up  1.0  1.0
 15  0.89000 osd.15  up  1.0  1.0
 16  0.89000 osd.16  up  1.0  1.0
 13  0.89000 osd.13  up  1.0  1.0
 18  0.89000 osd.18  up  1.0  1.0
  8  0.89000 osd.8   up  1.0  1.0
 11  0.89000 osd.11  up  1.0  1.0
 20  0.89000 osd.20  up  1.0  1.0
 -4  0.47998 host croc2
 10  0.06000 osd.10  up  1.0  1.0
 12  0.06000 osd.12  up  1.0  1.0
 14  0.06000 osd.14  up  1.0  1.0
 17  0.06000 osd.17  up  1.0  1.0
 19  0.06000 osd.19  up  1.0  1.0
 21  0.06000 osd.21  up  1.0  1.0
  9  0.06000 osd.9   up  1.0  1.0
 23  0.06000 osd.23  up  1.0  1.0
 -2  7.12000 host croc1
  7  0.89000 osd.7   up  1.0  1.0
  2  0.89000 osd.2   up  1.0  1.0
  6  0.89000 osd.6   up  1.0  1.0
  1  0.89000 osd.1   up  1.0  1.0
  5

[ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

2015-08-22 Thread Alex Gorbachev
Hello, this is an issue we have been suffering from and researching
along with a good number of other Ceph users, as evidenced by the
recent posts.  In our specific case, these issues manifest themselves
in a RBD - iSCSI LIO - ESXi configuration, but the problem is more
general.

When there is an issue on OSD nodes (examples: network hangs/blips,
disk HBAs failing, driver issues, page cache/XFS issues), some OSDs
respond slowly or with significant delays.  ceph osd perf does not
show this, neither does ceph osd tree, ceph -s / ceph -w.  Instead,
the RBD IO hangs to a point where the client times out, crashes or
displays other unsavory behavior - operationally this crashes
production processes.

Today in our lab we had a disk controller issue, which brought an OSD
node down.  Upon restart, the OSDs started up and rejoined into the
cluster.  However, immediately all IOs started hanging for a long time
and aborts from ESXi - LIO were not succeeding in canceling these
IOs.  The only warning I could see was:

root@lab2-mon1:/var/log/ceph# ceph health detail
HEALTH_WARN 30 requests are blocked  32 sec;
1 osds have slow requests 30 ops are blocked  2097.15 sec
30 ops are blocked  2097.15 sec on osd.4
1 osds have slow requests

However, ceph osd perf is not showing high latency on osd 4:

root@lab2-mon1:/var/log/ceph# ceph osd perf
osd fs_commit_latency(ms) fs_apply_latency(ms)
  0 0   13
  1 00
  2 00
  3   172  208
  4 00
  5 00
  6 01
  7 00
  8   174  819
  9 6   10
 10 01
 11 01
 12 35
 13 01
 14 7   23
 15 01
 16 00
 17 59
 18 01
 1910   18
 20 00
 21 00
 22 01
 23 5   10

SMART state for osd 4 disk is OK.  The OSD in up and in:

root@lab2-mon1:/var/log/ceph# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-80 root ssd
-7 14.71997 root platter
-3  7.12000 host croc3
22  0.89000 osd.22  up  1.0  1.0
15  0.89000 osd.15  up  1.0  1.0
16  0.89000 osd.16  up  1.0  1.0
13  0.89000 osd.13  up  1.0  1.0
18  0.89000 osd.18  up  1.0  1.0
 8  0.89000 osd.8   up  1.0  1.0
11  0.89000 osd.11  up  1.0  1.0
20  0.89000 osd.20  up  1.0  1.0
-4  0.47998 host croc2
10  0.06000 osd.10  up  1.0  1.0
12  0.06000 osd.12  up  1.0  1.0
14  0.06000 osd.14  up  1.0  1.0
17  0.06000 osd.17  up  1.0  1.0
19  0.06000 osd.19  up  1.0  1.0
21  0.06000 osd.21  up  1.0  1.0
 9  0.06000 osd.9   up  1.0  1.0
23  0.06000 osd.23  up  1.0  1.0
-2  7.12000 host croc1
 7  0.89000 osd.7   up  1.0  1.0
 2  0.89000 osd.2   up  1.0  1.0
 6  0.89000 osd.6   up  1.0  1.0
 1  0.89000 osd.1   up  1.0  1.0
 5  0.89000 osd.5   up  1.0  1.0
 0  0.89000 osd.0   up  1.0  1.0
 4  0.89000 osd.4   up  1.0  1.0
 3  0.89000 osd.3   up  1.0  1.0

How can we proactively detect this condition?  Is there anything I can
run that will output all slow OSDs?

Regards,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com