Is there any known connection with the previous discussion " Hit suicide 
timeout after adding new osd" or "Ceph unstable on XFS" ?

-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
Sent: 2013年1月22日 14:06
To: ceph-devel@vger.kernel.org
Subject: handling fs errors

We observed an interesting situation over the weekend.  The XFS volume ceph-osd 
locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes.  After 3 
minutes (180s), ceph-osd gave up waiting and committed suicide.  XFS seemed to 
unwedge itself a bit after that, as the daemon was able to restart and continue.

The problem is that during that 180s the OSD was claiming to be alive but not 
able to do any IO.  That heartbeat check is meant as a sanity check against a 
wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the 
cluster quickly enough and client IO stalled.

We could simply change that timeout to something close to the heartbeat 
interval (currently default is 20s).  That will make ceph-osd much more 
sensitive to fs stalls that may be transient (high load, whatever).

Another option would be to make the osd heartbeat replies conditional on 
whether the internal heartbeat is healthy.  Then the heartbeat warnings could 
start at 10-20s, ping replies would pause, but the suicide could still be 180s 
out.  If the stall is short-lived, pings will continue, the osd will mark 
itself back up (if it was marked down) and continue.

Having written that out, the last option sounds like the obvious choice.  
Any other thoughts?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f"�h���z��wア�
⒎�j:+v���w�j�m������赙zZ+�����茛j"��!�i

Reply via email to