RE: handling fs errors

2013-01-22 Thread Chen, Xiaoxi
Is there any known connection with the previous discussion  Hit suicide 
timeout after adding new osd or Ceph unstable on XFS ?

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: 2013年1月22日 14:06
To: ceph-devel@vger.kernel.org
Subject: handling fs errors

We observed an interesting situation over the weekend.  The XFS volume ceph-osd 
locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes.  After 3 
minutes (180s), ceph-osd gave up waiting and committed suicide.  XFS seemed to 
unwedge itself a bit after that, as the daemon was able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but not 
able to do any IO.  That heartbeat check is meant as a sanity check against a 
wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the 
cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat 
interval (currently default is 20s).  That will make ceph-osd much more 
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on 
whether the internal heartbeat is healthy.  Then the heartbeat warnings could 
start at 10-20s, ping replies would pause, but the suicide could still be 180s 
out.  If the stall is short-lived, pings will continue, the osd will mark 
itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.  
Any other thoughts?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
N�Р骒r��yb�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay��,j��f"�h���z��wア�
⒎�j:+v���w�j�m��赙zZ+�茛j��!�i

Re: handling fs errors

2013-01-22 Thread Wido den Hollander



On 01/22/2013 07:12 AM, Yehuda Sadeh wrote:

On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com wrote:

We observed an interesting situation over the weekend.  The XFS volume
ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but
not able to do any IO.  That heartbeat check is meant as a sanity check
against a wedged kernel, but waiting so long meant that the ceph-osd
wasn't failed by the cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat
interval (currently default is 20s).  That will make ceph-osd much more
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on
whether the internal heartbeat is healthy.  Then the heartbeat warnings
could start at 10-20s, ping replies would pause, but the suicide could
still be 180s out.  If the stall is short-lived, pings will continue, the
osd will mark itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.
Any other thoughts?



Another option would be to have the osd reply to the ping with some
health description.



Looking to the future with more monitoring that might be a good idea.

If an OSD simply stops sending heartbeats if the internal conditions 
aren't met you don't know what's going on.


If the heartbeat would have metadata which tells: I'm here, but not in 
such a good shape that could be reported back to the monitors.


Monitoring tools could read this out and could sent out 
notifications/alerts to where they want.


Now we assume I/O completely stalls, but the metadata could also contain 
high latency? If the latency goes over threshold X you can still mark 
the OSD out temporarily since it will impact clients, but some 
information towards the monitor might be useful.


Wido


Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 5:12 AM, Wido den Hollander wrote:
 On 01/22/2013 07:12 AM, Yehuda Sadeh wrote:
  On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com 
  (mailto:s...@inktank.com) wrote:
   We observed an interesting situation over the weekend. The XFS volume
   ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
   minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed
   suicide. XFS seemed to unwedge itself a bit after that, as the daemon was
   able to restart and continue.
   
   The problem is that during that 180s the OSD was claiming to be alive but
   not able to do any IO. That heartbeat check is meant as a sanity check
   against a wedged kernel, but waiting so long meant that the ceph-osd
   wasn't failed by the cluster quickly enough and client IO stalled.
   
   We could simply change that timeout to something close to the heartbeat
   interval (currently default is 20s). That will make ceph-osd much more
   sensitive to fs stalls that may be transient (high load, whatever).
   
   Another option would be to make the osd heartbeat replies conditional on
   whether the internal heartbeat is healthy. Then the heartbeat warnings
   could start at 10-20s, ping replies would pause, but the suicide could
   still be 180s out. If the stall is short-lived, pings will continue, the
   osd will mark itself back up (if it was marked down) and continue.
   
   Having written that out, the last option sounds like the obvious choice.
   Any other thoughts?
  
  
  
  Another option would be to have the osd reply to the ping with some
  health description.
 
 
 
 Looking to the future with more monitoring that might be a good idea.
 
 If an OSD simply stops sending heartbeats if the internal conditions 
 aren't met you don't know what's going on.
 
 If the heartbeat would have metadata which tells: I'm here, but not in 
 such a good shape that could be reported back to the monitors.


I think we want to move towards more comprehensive pinging like this, but it's 
not something to do in haste. Pausing pings when the internal threads are 
disappearing sounds like a good simple step to make the reporting better match 
reality.
-Greg 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-22 Thread Dimitri Maziuk
On 01/22/2013 12:05 AM, Sage Weil wrote:
 We observed an interesting situation over the weekend.  The XFS volume 
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 
 minutes.
...

FWIW I see this often enough on cheap sata drives: they've a failure
mode that makes sata driver timeout, reset the link, resend the command,
rinse, lather, repeat. (You usually get slow to respond, please be
patient and/or resetting link in syslog  console.) It's at the low
enough level to freeze the whole system for minutes.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: handling fs errors

2013-01-22 Thread Sage Weil
On Wed, 23 Jan 2013, Andrey Korolyov wrote:
 On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote:
  We observed an interesting situation over the weekend.  The XFS volume
  ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
  minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
  suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
  able to restart and continue.
 
  The problem is that during that 180s the OSD was claiming to be alive but
  not able to do any IO.  That heartbeat check is meant as a sanity check
  against a wedged kernel, but waiting so long meant that the ceph-osd
  wasn't failed by the cluster quickly enough and client IO stalled.
 
  We could simply change that timeout to something close to the heartbeat
  interval (currently default is 20s).  That will make ceph-osd much more
  sensitive to fs stalls that may be transient (high load, whatever).
 
  Another option would be to make the osd heartbeat replies conditional on
  whether the internal heartbeat is healthy.  Then the heartbeat warnings
  could start at 10-20s, ping replies would pause, but the suicide could
  still be 180s out.  If the stall is short-lived, pings will continue, the
  osd will mark itself back up (if it was marked down) and continue.
 
  Having written that out, the last option sounds like the obvious choice.
  Any other thoughts?
 
  sage
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 By the way, is there any worth of preventing situation below - one
 host buried down in the softlocks marking almost all neighbors down
 for a short time? That`s harmless, because all osds will rejoin
 cluster in couple of next heartbeats, but all I/O will stuck at this
 time, rather than only operations with pgs on failing osd, so may be
 it`ll be useful to introduce kind of down-mark-quorum for such cases.
 
 2013-01-22 14:40:31.481174 mon.0 [INF] osd.0 10.5.0.10:6800/6578
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481085 =
 grace 20
 .00)

I believe there is already a tuanble to adjust this... 'min reporters' or 
something, check 'ceph --show-config | grep ^mon'.

sage


 2013-01-22 14:40:31.481293 mon.0 [INF] osd.1 10.5.0.11:6800/6488
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481228 =
 grace 20
 .00)
 2013-01-22 14:40:31.481410 mon.0 [INF] osd.2 10.5.0.12:6803/7561
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481355 =
 grace 20
 .00)
 2013-01-22 14:40:31.481522 mon.0 [INF] osd.4 10.5.0.14:6803/5697
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481467 =
 grace 20
 .00)
 2013-01-22 14:40:31.481641 mon.0 [INF] osd.6 10.5.0.16:6803/5679
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481586 =
 grace 20.00)
 2013-01-22 14:40:31.481746 mon.0 [INF] osd.8 10.5.0.10:6803/6638
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481700 =
 grace 20.00)
 2013-01-22 14:40:31.481863 mon.0 [INF] osd.9 10.5.0.11:6803/6547
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481811 =
 grace 20.00)
 2013-01-22 14:40:31.481976 mon.0 [INF] osd.10 10.5.0.12:6800/7019
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.481916 =
 grace 20.00)
 2013-01-22 14:40:31.482077 mon.0 [INF] osd.12 10.5.0.14:6800/5637
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482022 =
 grace 20.00)
 2013-01-22 14:40:31.482184 mon.0 [INF] osd.14 10.5.0.16:6800/5620
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482130 =
 grace 20.00)
 2013-01-22 14:40:31.482334 mon.0 [INF] osd.17 10.5.0.31:6800/5854
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482275 =
 grace 20.00)
 2013-01-22 14:40:31.482436 mon.0 [INF] osd.18 10.5.0.32:6800/5981
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482389 =
 grace 20.00)
 2013-01-22 14:40:31.482539 mon.0 [INF] osd.19 10.5.0.33:6800/5570
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482489 =
 grace 20.00)
 2013-01-22 14:40:31.482667 mon.0 [INF] osd.20 10.5.0.34:6800/5643
 failed (3 reports from 1 peers after 2013-01-22 14:40:54.482620 =
 grace 20.00)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-22 Thread Sage Weil
On Tue, 22 Jan 2013, Dimitri Maziuk wrote:
 On 01/22/2013 12:05 AM, Sage Weil wrote:
  We observed an interesting situation over the weekend.  The XFS volume 
  ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 
  minutes.
 ...
 
 FWIW I see this often enough on cheap sata drives: they've a failure
 mode that makes sata driver timeout, reset the link, resend the command,
 rinse, lather, repeat. (You usually get slow to respond, please be
 patient and/or resetting link in syslog  console.) It's at the low
 enough level to freeze the whole system for minutes.

Excellent point, I'd forgotten about that.

This task is queued up for this sprint, by the way, and marked for 
backport to bobtail.

http://tracker.newdream.net/issues/3888

Thanks-
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


handling fs errors

2013-01-21 Thread Sage Weil
We observed an interesting situation over the weekend.  The XFS volume 
ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 
minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed 
suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was 
able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but 
not able to do any IO.  That heartbeat check is meant as a sanity check 
against a wedged kernel, but waiting so long meant that the ceph-osd 
wasn't failed by the cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat 
interval (currently default is 20s).  That will make ceph-osd much more 
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on 
whether the internal heartbeat is healthy.  Then the heartbeat warnings 
could start at 10-20s, ping replies would pause, but the suicide could 
still be 180s out.  If the stall is short-lived, pings will continue, the 
osd will mark itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.  
Any other thoughts?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-21 Thread Yehuda Sadeh
On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com wrote:
 We observed an interesting situation over the weekend.  The XFS volume
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
 minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
 suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
 able to restart and continue.

 The problem is that during that 180s the OSD was claiming to be alive but
 not able to do any IO.  That heartbeat check is meant as a sanity check
 against a wedged kernel, but waiting so long meant that the ceph-osd
 wasn't failed by the cluster quickly enough and client IO stalled.

 We could simply change that timeout to something close to the heartbeat
 interval (currently default is 20s).  That will make ceph-osd much more
 sensitive to fs stalls that may be transient (high load, whatever).

 Another option would be to make the osd heartbeat replies conditional on
 whether the internal heartbeat is healthy.  Then the heartbeat warnings
 could start at 10-20s, ping replies would pause, but the suicide could
 still be 180s out.  If the stall is short-lived, pings will continue, the
 osd will mark itself back up (if it was marked down) and continue.

 Having written that out, the last option sounds like the obvious choice.
 Any other thoughts?


Another option would be to have the osd reply to the ping with some
health description.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: handling fs errors

2013-01-21 Thread Andrey Korolyov
On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote:
 We observed an interesting situation over the weekend.  The XFS volume
 ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4
 minutes.  After 3 minutes (180s), ceph-osd gave up waiting and committed
 suicide.  XFS seemed to unwedge itself a bit after that, as the daemon was
 able to restart and continue.

 The problem is that during that 180s the OSD was claiming to be alive but
 not able to do any IO.  That heartbeat check is meant as a sanity check
 against a wedged kernel, but waiting so long meant that the ceph-osd
 wasn't failed by the cluster quickly enough and client IO stalled.

 We could simply change that timeout to something close to the heartbeat
 interval (currently default is 20s).  That will make ceph-osd much more
 sensitive to fs stalls that may be transient (high load, whatever).

 Another option would be to make the osd heartbeat replies conditional on
 whether the internal heartbeat is healthy.  Then the heartbeat warnings
 could start at 10-20s, ping replies would pause, but the suicide could
 still be 180s out.  If the stall is short-lived, pings will continue, the
 osd will mark itself back up (if it was marked down) and continue.

 Having written that out, the last option sounds like the obvious choice.
 Any other thoughts?

 sage

Seems to be possible to run in domino-style failing marks there if
lock is triggered frequently enough and depends only on pure amount of
workload. By the way, was that fs aged or you`re able to catch the
lock on fresh one? And which kernel you have run there?

Thanks!

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html