Unless this is related to load and OSDs really are unreponsive, it is
almost certainly some sort of network issue. Duplicate IP address
maybe?


 
Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 
 
If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

 

On Tue, 2018-10-02 at 17:17 +0200, Vincent Godin wrote:
> Ceph cluster in Jewel 10.2.11
> Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64
> 
> Everyday, we can see in ceph.log on Monitor a lot of logs like these :
> 
> 2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
> cluster [WRN] map e612590 wrongly marked me down
> 2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
> cluster [WRN] map e612588 wrongly marked me down
> 2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
> cluster [WRN] map e612591 wrongly marked me down
> 2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
> cluster [WRN] map e612599 wrongly marked me down
> 2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
> cluster [WRN] map e612597 wrongly marked me down
> 2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
> cluster [WRN] map e612608 wrongly marked me down
> 2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
> cluster [WRN] map e612604 wrongly marked me down
> 2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
> cluster [WRN] map e612607 wrongly marked me down
> 2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
> cluster [WRN] map e612609 wrongly marked me down
> 
> On the server 192.168.1.228 for example, the /var/log/messages looks like :
> 
> Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
> Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
> 7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
> 7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
> 7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
> 7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
> 192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
> 2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
> 7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)
> 
> There is no network problem at that time (i checked the logs on the
> host and on the switch). OSD logs shows nothing but "wrongly marked me
> down" and sessions reset due to this monitor action. As several OSDs
> are impacted, it looks like a host problem.
> 
> The sysctl.conf is:
> 
> net.core.rmem_max=56623104
> net.core.wmem_max=56623104
> net.core.rmem_default=56623104
> net.core.wmem_default=56623104
> net.core.optmem_max=40960
> net.ipv4.tcp_rmem=4096 87380 56623104
> net.ipv4.tcp_wmem=4096 65536 56623104
> net.core.somaxconn=1024
> net.core.netdev_max_backlog=50000
> net.ipv4.tcp_max_syn_backlog=30000
> net.ipv4.tcp_max_tw_buckets=2000000
> net.ipv4.tcp_tw_reuse=1
> net.ipv4.tcp_fin_timeout=10
> net.ipv4.tcp_slow_start_after_idle=0
> net.ipv4.udp_rmem_min=8192
> net.ipv4.udp_wmem_min=8192
> net.ipv4.conf.all.send_redirects=0
> net.ipv4.conf.all.accept_redirects=0
> net.ipv4.conf.all.accept_source_route=0
> 
> kernel.pid_max=4194303
> fs.file-max=26234859
> 
> Does someone has any idea or has already met this behaviour ?
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to