Hello all,
I've got a 30 node cluster serving up lots of CephFS data.
We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
this week.
We've been running 2 MDS daemons in an active-active setup. Tonight
one of the metadata daemons crashed with the following several times:
-1> 2019-05-16 00:20:56.775 7f9f22405700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
In function 'void CIn
ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
00:20:56.775021
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
1114: FAILED ceph_assert(parent == 0 || g_conf().get_val<bool>("mds_h
ack_allow_loading_invalid_metadata"))
I made a quick decision to move to a single MDS because I saw
set_primary_parent, and I thought it might be related to auto
balancing between the metadata servers.
This caused one MDS to fail, the other crashed, and now rank 0 loads,
goes active and then crashes with the following:
-1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
In function 'void M
DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
258: FAILED ceph_assert(!p)
It now looks like we somehow have a duplicate inode in the MDS journal?
https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
then became rank one after the crash and attempted drop to one active
MDS
https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
and crashed
Anyone have any thoughts on this?
Thanks,
Adam
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com