Hi all,

we seem to have hit a bug in the ceph fs kernel client and I just want to 
confirm what action to take. We get the error "wrong peer at address" in dmesg 
and some jobs on that server seem to get stuck in fs access; log extract below. 
I found these 2 tracker items https://tracker.ceph.com/issues/23883 and 
https://tracker.ceph.com/issues/41519, which don't seem to have fixes.

My questions:

- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?

Here an extract from the dmesg log, the error has survived a couple of MDS 
restarts already:

[Mon Mar  6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:05:18 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar  6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:13:50 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar  6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar  6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar  6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar  6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar  6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar  6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con 
state OPEN)
[Mon Mar  6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar  6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar  6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar  6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar  6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar  6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar  6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar  6 13:22:22 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar  6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Mon Mar  6 13:30:54 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar  9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar  9 09:38:26 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 09:46:58 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 09:55:30 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address
[Thu Mar  9 10:04:02 2023] libceph: wrong peer, want 
192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar  9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at 
address

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to