Re: [ceph-users] Node failure -- corrupt memory

Wido den Hollander Fri, 15 Nov 2019 04:29:05 -0800

On 11/11/19 2:00 PM, Shawn Iverson wrote:
> Hello Cephers!
> 
> I had a node over the weekend go nuts from what appears to have been
> failed/bad memory modules and/or motherboard.
> 
> This resulted in several OSDs blocking IO for > 128s (indefinitely).
> 
> I was not watching my alerts too closely over the weekend, or else I may
> have caught it early. The servers in the entire cluster reliant on ceph
> stalled from the blocked IO on this failing node and had to be restarted
> after taking the faulty node offline.
> 
> So, my question is, is there a way to tell ceph to start setting OSDs
> out in the event of an IO blockage that exceeds a certain limit, or are
> there risks in doing so that I would be better off dealing with a
> stalled ceph cluster?
> 

In the end the OSDs should commit suicide, but this is always a problem
with hardware. The best would be if you could have the Linux machine
just kill itself and not have the OSD handle this.

So for example just kill the node when memory problems occur.
panic_on_oom for example (not happening in this case) is something I've
set before. (combined with kernel.panic=60)

Wido

> -- 
> Shawn Iverson, CETL
> Director of Technology
> Rush County Schools
> ivers...@rushville.k12.in.us <mailto:ivers...@rushville.k12.in.us>
> 
> Cybersecurity
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Node failure -- corrupt memory

Reply via email to