There are multiple settings that affect this. osd_heartbeat_grace is
probably the most apt. If an OSD is not getting a response from another
OSD for more than the heartbeat_grace period, then it will tell the mons
that the OSD is down. Once mon_osd_min_down_reporters have told the mons
that an OSD is down, then the OSD will be marked down by the cluster. If
the OSD does not then talk to the mons directly to say that it is up, it
will be marked out after mon_osd_down_out_interval is reached. If it does
talk to the mons to say that it is up, then it should be responding again
and be fine.
In your case where the OSD is half up, half down... I believe all you can
really do is monitor your cluster and troubleshoot OSDs causing problems
like this. Basically every storage solution is vulnerable to this.
Sometimes an OSD just needs to be restarted due to being in a bad state
somehow, or simply removed from the cluster because the disk is going bad.
On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_...@163.com> wrote:
> Hi list,
> During my test of ceph,I find sometime the whole ceph cluster are blocked
> and the reason was one unfunctional osd.Ceph can heal itself if some osd is
> down, but it seems if some osd is half dead (have heart beat but can't
> handle request) then all the request which are directed to that osd would
> be blocked. If all osds are in one pool and the whole cluster would be
> blocked due to that one hanged osd.
> I think this is because ceph will try to distribute the request to all
> osds and if one of the osd wont confirm the request is done then everything
> is blocked.
> Is there a way to let ceph to mark the the crippled osd down if the
> requests direct to that osd are blocked more than certain time to avoid the
> whole cluster is blocked?
> ceph-users mailing list
ceph-users mailing list