Hi all,

I didn't find much information regarding this kernel client loop in the
ML. Here are my observation, around which I'll try to investigate.

My setup:
* 2 datacenters connected using an IPsec tunnel configured for routing
(2 subnets)
* connection to the WAN using PPPoE and the pppd kernel module
* the PPP connection lasts exactly 7 days, after which the provider
kills it, and my PPP client restarts it (the WAN/inter-cluster
communication is thus disconnected during ~30s)
* 3 MON+n×OSD+MGR+MDS on each datacenter
* 2 client servers using cephfs/kernel module; one of them on each
datacenter runs the pppd client and the IPSec endpoint (Pacemaker
manages this front-end aspect of the cluster)
* a single cephfs mount which is not managed by Pacemaker

Observations:
* when the ppp0 connection stops, the pppd restores the default route
from "using the PPP tunnel" to "using a virtual IP which happens to be
on the same host" (but could move to the other peer)

Mar 28 19:07:09 neon pppd[5543]: restoring old default route to eth0 
[172.21.0.254]

* IPsec et al. react cleanly (remove the tunnel, recreate it when PPP
is up again)

Mar 28 19:07:43 neon pppd[5543]: Connected to 02:....:42 via interface eth1.835
Mar 28 19:07:43 neon pppd[5543]: CHAP authentication succeeded
Mar 28 19:07:43 neon pppd[5543]: peer from calling number 02:....:42 authorized
Mar 28 19:07:43 neon pppd[5543]: replacing old default route to eth0 
[172.21.0.254]

* 20s after the PPP link is up and IPsec is restored, libceph starts to
complain (neon is the client/gateway on 172.21.0.0/16 which lost its
PPP, sodium is the remote side of the same IPsec tunnel) :

Mar 28 19:08:03 neon kernel: [1232455.656828] libceph: mon1 172.21.0.18:6789 
socket closed (con state OPEN)
Mar 28 19:08:12 neon kernel: [1232463.846633] ceph: mds0 caps stale
Mar 28 19:08:16 neon kernel: [1232468.128577] ceph: mds0 caps went stale, 
renewing
Mar 28 19:08:16 neon kernel: [1232468.128581] ceph: mds0 caps stale
Mar 28 19:08:30 neon kernel: [1232482.601183] libceph: mon3 172.22.0.16:6789 
session established
Mar 28 19:09:01 neon kernel: [1232513.256059] libceph: mon3 172.22.0.16:6789 
session lost, hunting for new mon
Mar 28 19:09:01 neon kernel: [1232513.321176] libceph: mon5 172.22.0.20:6789 
session established
Mar 28 19:09:32 neon kernel: [1232543.977003] libceph: mon5 172.22.0.20:6789 
session lost, hunting for new mon
Mar 28 19:09:32 neon kernel: [1232543.979567] libceph: mon2 172.21.0.20:6789 
session established
Mar 28 19:09:39 neon kernel: [1232551.435001] ceph: mds0 caps renewed
Mar 28 19:10:02 neon kernel: [1232574.697885] libceph: mon2 172.21.0.20:6789 
session lost, hunting for new mon
Mar 28 19:10:02 neon kernel: [1232574.763614] libceph: mon4 172.22.0.18:6789 
session established
Mar 28 19:10:33 neon kernel: [1232605.418776] libceph: mon4 172.22.0.18:6789 
session lost, hunting for new mon
Mar 28 19:10:33 neon kernel: [1232605.420896] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 19:11:04 neon kernel: [1232636.139720] libceph: mon0 172.21.0.16:6789 
session lost, hunting for new mon
Mar 28 19:11:04 neon kernel: [1232636.205717] libceph: mon3 172.22.0.16:6789 
session established

Mar 28 19:07:40 sodium kernel: [1211268.708716] libceph: mon0 172.21.0.16:6789 
session lost, hunting for new mon
Mar 28 19:07:44 sodium kernel: [1211272.208735] libceph: mon5 172.22.0.20:6789 
socket closed (con state OPEN)
Mar 28 19:07:53 sodium kernel: [1211281.683700] libceph: mon2 172.21.0.20:6789 
socket closed (con state OPEN)
Mar 28 19:08:18 sodium kernel: [1211306.856489] libceph: mon5 172.22.0.20:6789 
session established
Mar 28 19:08:49 sodium kernel: [1211337.575101] libceph: mon5 172.22.0.20:6789 
session lost, hunting for new mon
Mar 28 19:08:49 sodium kernel: [1211337.640884] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 19:09:20 sodium kernel: [1211368.296187] libceph: mon0 172.21.0.16:6789 
session lost, hunting for new mon
Mar 28 19:09:20 sodium kernel: [1211368.299194] libceph: mon4 172.22.0.18:6789 
session established
Mar 28 19:09:50 sodium kernel: [1211399.017229] libceph: mon4 172.22.0.18:6789 
session lost, hunting for new mon
Mar 28 19:09:50 sodium kernel: [1211399.019655] libceph: mon5 172.22.0.20:6789 
session established

* the active MDS happens to be on sodium's side (172.22.*), whereas the
primary MON happens to be on neon's side (172.21.*), which explains the
dissimilar messages

* the remote peer for sodium also had a cephfs mount, and also
complained (lithium just experienced a lack of response from its
gateway, a virtual IP on sodium, which IPsec tunnel to neon was down) :

Mar 28 19:07:35 lithium kernel: [603537.458858] libceph: mon2 172.21.0.20:6789 
session lost, hunting for new mon
Mar 28 19:07:44 lithium kernel: [603545.702369] libceph: mon5 172.22.0.20:6789 
socket closed (con state OPEN)
Mar 28 19:07:58 lithium kernel: [603560.177718] libceph: mon2 172.21.0.20:6789 
socket closed (con state OPEN)
Mar 28 19:08:16 lithium kernel: [603577.883001] ceph: mds0 caps went stale, 
renewing
Mar 28 19:08:16 lithium kernel: [603577.883004] ceph: mds0 caps stale
Mar 28 19:08:16 lithium kernel: [603577.883483] ceph: mds0 caps renewed
Mar 28 19:08:19 lithium kernel: [603581.559718] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 19:08:52 lithium kernel: [603614.261565] libceph: mon0 172.21.0.16:6789 
session lost, hunting for new mon
Mar 28 19:08:52 lithium kernel: [603614.263845] libceph: mon3 172.22.0.16:6789 
session established
Mar 28 19:09:23 lithium kernel: [603644.982627] libceph: mon3 172.22.0.16:6789 
session lost, hunting for new mon
Mar 28 19:09:23 lithium kernel: [603644.984546] libceph: mon5 172.22.0.20:6789 
session established
Mar 28 19:09:54 lithium kernel: [603675.703696] libceph: mon5 172.22.0.20:6789 
session lost, hunting for new mon
Mar 28 19:09:54 lithium kernel: [603675.773835] libceph: mon2 172.21.0.20:6789 
session established

* all those "hunting for" messages stop when I unmount/remount the
cephfs filesystem (supported by the kernel module, not Fuse)

Mar 28 20:38:27 neon kernel: [1237879.172749] libceph: mon2 172.21.0.20:6789 
session lost, hunting for new mon
Mar 28 20:38:27 neon kernel: [1237879.238902] libceph: mon4 172.22.0.18:6789 
session established
Mar 28 20:38:57 neon kernel: [1237909.893569] libceph: mon4 172.22.0.18:6789 
session lost, hunting for new mon
Mar 28 20:38:57 neon kernel: [1237909.895888] libceph: mon1 172.21.0.18:6789 
session established
Mar 28 20:39:31 neon kernel: [1237942.989104] libceph: mon0 172.21.0.16:6789 
session established
Mar 28 20:39:31 neon kernel: [1237942.990244] libceph: client114108 fsid 
819889bd-de05-4bf5-ab43-da16d93f9308

I suspect that all this is related to the kernel routing table which is
altered by pppd, restored to it's original value, then re-updated when
the PPP link re-opens. I have experienced problems with some daemons
like dnsmasq, ntpd, etc. The only solution seems to be to restart those
deamons.
I may have to unmount/remount cephfs to have the same effect. I'll also
try cephfs/Fuse.

Did anyone dig into the cause of this flurry of messages?

TIA,

-- 
Nicolas Huillard
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to