Hey guys,

Today I noticed when adding new monitors to the cluster that two OSD
servers couldn't talk to each other for some reason. I am not sure if
adding the monitors caused this issue or whether the issue was always there
but adding the monitor showed it. After removing the new monitor the
cluster went back to healthy but the following errors are still being
spewed.

On both servers all the OSD logs show various messages like:

2016-06-20 12:51:32.148682 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.89 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148699 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.90 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148708 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.91 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148717 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.92 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148724 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.93 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148763 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.95 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)
2016-06-20 12:51:32.148770 7f6d24024700 -1 osd.102 17667 heartbeat_check:
no reply from osd.96 ever on either front or back, first ping sent
2016-06-20 11:12:47.527049 (cutoff 2016-06-20 12:51:12.148679)

On Server A these errors are all generated mentioning Server B's OSDs and
on Server B it's reported on Server A's OSDs. None of the other 10 servers
have any of these issues.

I confirmed using telnet that the OSD ports are reachable.

I'm using a cluster and public network, one of the things I did notice is
this error: "0 -- private-ip-server-a:0/15329 >>
public-ip-server-b:6806/6465 pipe(0x7f9910761000 sd=64 :0 s=1 pgs=0 cs=0
l=1 c=0x7f9910f7e100).fault"

This seems to imply that server A is trying to connect to server B from
it's cluster ip to the client ip. Could this be the root cause? And if so
how can I prevent that from happening?

Thanks,

Peter
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to