Re: [ceph-users] Strange Ceph host behaviour

2018-10-02 Thread Steve Taylor
Unless this is related to load and OSDs really are unreponsive, it is
almost certainly some sort of network issue. Duplicate IP address
maybe?


 
Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | 
 
If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.

 

On Tue, 2018-10-02 at 17:17 +0200, Vincent Godin wrote:
> Ceph cluster in Jewel 10.2.11
> Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64
> 
> Everyday, we can see in ceph.log on Monitor a lot of logs like these :
> 
> 2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
> cluster [WRN] map e612590 wrongly marked me down
> 2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
> cluster [WRN] map e612588 wrongly marked me down
> 2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
> cluster [WRN] map e612591 wrongly marked me down
> 2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
> cluster [WRN] map e612599 wrongly marked me down
> 2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
> cluster [WRN] map e612597 wrongly marked me down
> 2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
> cluster [WRN] map e612608 wrongly marked me down
> 2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
> cluster [WRN] map e612604 wrongly marked me down
> 2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
> cluster [WRN] map e612607 wrongly marked me down
> 2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
> cluster [WRN] map e612609 wrongly marked me down
> 
> On the server 192.168.1.228 for example, the /var/log/messages looks like :
> 
> Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
> Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
> 7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
> 7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
> Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
> 7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
> 192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
> 2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
> 7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
> 192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
> 2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
> 7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
> 192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
> 2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
> 7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
> Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
> 7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
> 192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
> 2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)
> 
> There is no network problem at that time (i checked the logs on the
> host and on the switch). OSD logs shows nothing but "wrongly marked me
> down" and sessions reset due to this monitor action. As several OSDs
> are impacted, it looks like a host problem.
> 
> The sysctl.conf is:
> 
> net.core.rmem_max=56623104
> net.core.wmem_max=56623104
> net.core.rmem_default=56623104
> 

[ceph-users] Strange Ceph host behaviour

2018-10-02 Thread Vincent Godin
Ceph cluster in Jewel 10.2.11
Mons & Hosts are on CentOS 7.5.1804 kernel 3.10.0-862.6.3.el7.x86_64

Everyday, we can see in ceph.log on Monitor a lot of logs like these :

2018-10-02 16:07:08.882374 osd.478 192.168.1.232:6838/7689 386 :
cluster [WRN] map e612590 wrongly marked me down
2018-10-02 16:07:06.462653 osd.464 192.168.1.232:6830/6650 317 :
cluster [WRN] map e612588 wrongly marked me down
2018-10-02 16:07:10.717673 osd.470 192.168.1.232:6836/7554 371 :
cluster [WRN] map e612591 wrongly marked me down
2018-10-02 16:14:51.179945 osd.414 192.168.1.227:6808/4767 670 :
cluster [WRN] map e612599 wrongly marked me down
2018-10-02 16:14:48.422442 osd.403 192.168.1.227:6832/6727 509 :
cluster [WRN] map e612597 wrongly marked me down
2018-10-02 16:15:13.198180 osd.436 192.168.1.228:6828/6402 533 :
cluster [WRN] map e612608 wrongly marked me down
2018-10-02 16:15:08.792369 osd.433 192.168.1.228:6832/6732 515 :
cluster [WRN] map e612604 wrongly marked me down
2018-10-02 16:15:11.680405 osd.429 192.168.1.228:6838/7393 536 :
cluster [WRN] map e612607 wrongly marked me down
2018-10-02 16:15:14.246717 osd.431 192.168.1.228:6822/5937 474 :
cluster [WRN] map e612609 wrongly marked me down

On the server 192.168.1.228 for example, the /var/log/messages looks like :

Oct  2 16:15:02 bd-ceph-22 ceph-osd: 2018-10-02 16:15:02.935658
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:42.935642)
Oct  2 16:15:03 bd-ceph-22 ceph-osd: 2018-10-02 16:15:03.935841
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:43.935824)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.283822
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:44.283811)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.353645
7f1110a32700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:44.353612)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.373905
7f71375de700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:14:59.065582 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.373897)
Oct  2 16:15:04 bd-ceph-22 ceph-osd: 2018-10-02 16:15:04.935997
7f716f16e700 -1 osd.432 612603 heartbeat_check: no reply from
192.168.1.215:6815 osd.242 since back 2018-10-02 16:15:04.369740 front
2018-10-02 16:14:42.046092 (cutoff 2018-10-02 16:14:44.935981)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.007484
7f10d97ec700 -1 osd.438 612603 heartbeat_check: no reply from
192.168.1.212:6807 osd.186 since back 2018-10-02 16:14:59.700105 front
2018-10-02 16:14:43.884248 (cutoff 2018-10-02 16:14:45.007477)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.017154
7fd4cee4d700 -1 osd.435 612603 heartbeat_check: no reply from
192.168.1.212:6833 osd.195 since back 2018-10-02 16:15:03.273909 front
2018-10-02 16:14:44.648411 (cutoff 2018-10-02 16:14:45.017106)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.158580
7fe343c96700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:00.450196 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.158567)
Oct  2 16:15:05 bd-ceph-22 ceph-osd: 2018-10-02 16:15:05.283983
7fe378c13700 -1 osd.426 612603 heartbeat_check: no reply from
192.168.1.215:6807 osd.240 since back 2018-10-02 16:15:05.154458 front
2018-10-02 16:14:43.433054 (cutoff 2018-10-02 16:14:45.283975)

There is no network problem at that time (i checked the logs on the
host and on the switch). OSD logs shows nothing but "wrongly marked me
down" and sessions reset due to this monitor action. As several OSDs
are impacted, it looks like a host problem.

The sysctl.conf is:

net.core.rmem_max=56623104
net.core.wmem_max=56623104
net.core.rmem_default=56623104
net.core.wmem_default=56623104
net.core.optmem_max=40960
net.ipv4.tcp_rmem=4096 87380 56623104
net.ipv4.tcp_wmem=4096 65536 56623104
net.core.somaxconn=1024
net.core.netdev_max_backlog=5
net.ipv4.tcp_max_syn_backlog=3
net.ipv4.tcp_max_tw_buckets=200
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_slow_start_after_idle=0
net.ipv4.udp_rmem_min=8192
net.ipv4.udp_wmem_min=8192
net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.all.accept_source_route=0

kernel.pid_max=4194303
fs.file-max=26234859

Does someone has any idea or has already met this behaviour ?
___
ceph-users mailing list
ceph-users@lists.ceph.com