Re: [ceph-users] Problematic inode preventing ceph-mds from starting
Hi! Yes, resetting journals is exactly what we did, quite a while ago, when the mds ran out of memory because a journal entry had an absurdly large number in it (I think it may have been an inode number). We probably also reset the inode table later, which I recently learned resets a data structure on disk, and probably started us overwriting inodes or dentries or both. So I take it (we are learning about filesystems very quickly over here) that ceph is reusing inode numbers. Re-scanning dentries will somehow figure out which dentry is most recent, and remove the older (now wrong) one. And somehow it can handle hard links, possibly (we don't have many, or any, of these). Thanks very much for your help. This has been fascinating. Neale From: Patrick Donnelly Sent: Monday, October 28, 2019 12:52 To: Pickett, Neale T Cc: ceph-users Subject: Re: [ceph-users] Problematic inode preventing ceph-mds from starting On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T wrote: > In the last week we have made a few changes to the down filesystem in an > attempt to fix what we thought was an inode problem: > > > cephfs-data-scan scan_extents # about 1 day with 64 processes > > cephfs-data-scan scan_inodes # about 1 day with 64 processes > > cephfs-data_scan scan_links # about 1 day Did you reset the journals or perform any other disaster recovery commands? This process likely introduced the duplicate inodes. > After these three, we tried to start an MDS and it stayed up. We then ran: > > ceph tell mds.a scrub start / recursive repair > > > The repair ran about 3 days, spewing logs to `ceph -w` about duplicated > inodes, until it stopped. All looked well until we began bringing production > services back online, at which point many error messages appeared, the mds > went back into damaged, and the fs back to degraded. At this point I removed > the objects you suggested, which brought everything back briefly. > > The latest crash is: > > -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: > In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time > 2019-1... > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc: > 258: FAILED ceph_assert(!p) This error indicates a duplicate inode loaded into cache. Fixing this probably requires significant intervention and (meta)data loss for recent changes: - Stop/unmount all clients. (Probably already the case if the rank is damaged!) - Reset the MDS journal [1] and optionally recover any dentries first. (This will hopefully resolve the ESubtreeMap errors you pasted.) Note that some metadata may be lost through this command. - `cephfs-data_scan scan_links` again. This should repair any duplicate inodes (by dropping the older dentries). - Then you can try marking the rank as repaired. Good luck! [1] https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Problematic inode preventing ceph-mds from starting
gga [ERR] replayed ESubtreeMap at 30403641312085 subtree root 0x1 is not mine in cache (it's -2,-2) 2019-10-25 18:48:05.728678 mds.frigga [ERR] replayed ESubtreeMap at 30403641478175 subtree root 0x1 is not mine in cache (it's -2,-2) 2019-10-25 18:48:05.734157 mds.frigga [ERR] failure replaying journal (EMetaBlob) 2019-10-25 18:48:06.676985 mon.coffee [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2019-10-25 18:48:06.677063 mon.coffee [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE) We now show mds: cephfs:0/1 5 up:standby, 1 damaged (we only have 5 mds servers) This still smells like an inode problem to me, but I have completely run out of ideas, so I will do nothing more to ceph as I anxoiusly hope I am not fired for this 14-days-and-counting outage while awaiting a reply from the list. Thank you very much! Neale From: Patrick Donnelly Sent: Thursday, October 24, 2019 17:57 To: Pickett, Neale T Cc: ceph-users Subject: Re: [ceph-users] Problematic inode preventing ceph-mds from starting Hi Neale, On Fri, Oct 18, 2019 at 9:31 AM Pickett, Neale T wrote: > > Last week I asked about a rogue inode that was causing ceph-mds to segfault > during replay. We didn't get any suggestions from this list, so we have been > familiarizing ourselves with the ceph source code, and have added the > following patch: > > > > --- a/src/mds/CInode.cc > +++ b/src/mds/CInode.cc > @@ -736,6 +736,13 @@ CDir *CInode::get_approx_dirfrag(frag_t fg) > > CDir *CInode::get_or_open_dirfrag(MDCache *mdcache, frag_t fg) > { > + if (!is_dir()) { > +ostringstream oss; > +JSONFormatter f(true); f.open_object_section("inode"); (otherwise the output is hard to read) > +dump(, DUMP_PATH | DUMP_INODE_STORE_BASE | DUMP_MDS_CACHE_OBJECT | > DUMP_LOCKS | DUMP_STATE | DUMP_CAPS | DUMP_DIRFRAGS); f.close_object_section("inode") > +f.flush(oss); > +dout(0) << oss.str() << dendl; > + } >ceph_assert(is_dir()); > >// have it? This seems like a generally useful patch. Feel free to submit a PR. > > This has given us a culprit: > > > > -2> 2019-10-18 16:19:06.934 7faefa470700 0 > mds.0.cache.ino(0x1995e63) > "/unimportant/path/we/can/tolerate/losing/compat.py"10995216789470"2018-03-24 > 03:18:17.621969""2018-03-24 03:18:17.620969"3318855521001{ > "dir_hash": 0 > } > { > "stripe_unit": 4194304, > "stripe_count": 1, > "object_size": 4194304, > "pool_id": 1, > "pool_ns": "" > } > [] > 3411844674407370955161500"2015-01-27 16:01:52.467669""2018-03-24 > 03:18:17.621969"21-1[] > { > "version": 0, > "mtime": "0.00", > "num_files": 0, > "num_subdirs": 0 > } > { > "version": 0, > "rbytes": 34, > "rfiles": 1, > "rsubdirs": 0, > "rsnaps": 0, > "rctime": "0.00" > } > { > "version": 0, > "rbytes": 34, > "rfiles": 1, > "rsubdirs": 0, > "rsnaps": 0, > "rctime": "0.00" > } > 2540123""""[] > { > "splits": [] > } > true{ > "replicas": {} > } > { > "authority": [ > 0, > -2 > ], > "replica_nonce": 0 > } > 0falsefalse{} > 0{ > "gather_set": [], > "state": "lock", > "is_leased": false, > "num_rdlocks": 0, > "num_wrlocks": 0, > "num_xlocks": 0, > "xlock_by": {} > } > {} > {} > {} > {} > {} > {} > {} > {} > {} > [ > "auth" > ] > [] > -1-1[] > [] > > -1> 2019-10-18 16:19:06.964 7faefa470700 -1 > /opt/app-root/src/ceph/src/mds/CInode.cc: In function 'CDir* > CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7faefa470700 time > 2019-10-18 16:19:06.934662 > /opt/app-root/src/ceph/src/mds/CInode.cc: 746: FAILED ceph_assert(is_dir()) > > ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x1aa) [0x7faf0a9ce39e] > 2: (()+0x12a8620) [0x7faf0a9ce620] > 3: (CInode::get_or_open_dirfrag(MDCache*, frag_t)+0x253) [0x557562a4b1ad] > 4: (OpenFileTable::_prefetch_dirfrags()+0x4db) [0x557562b63d63] > 5: (OpenFileTabl
[ceph-users] Problematic inode preventing ceph-mds from starting
Last week I asked about a rogue inode that was causing ceph-mds to segfault during replay. We didn't get any suggestions from this list, so we have been familiarizing ourselves with the ceph source code, and have added the following patch: --- a/src/mds/CInode.cc +++ b/src/mds/CInode.cc @@ -736,6 +736,13 @@ CDir *CInode::get_approx_dirfrag(frag_t fg) CDir *CInode::get_or_open_dirfrag(MDCache *mdcache, frag_t fg) { + if (!is_dir()) { +ostringstream oss; +JSONFormatter f(true); +dump(, DUMP_PATH | DUMP_INODE_STORE_BASE | DUMP_MDS_CACHE_OBJECT | DUMP_LOCKS | DUMP_STATE | DUMP_CAPS | DUMP_DIRFRAGS); +f.flush(oss); +dout(0) << oss.str() << dendl; + } ceph_assert(is_dir()); // have it? This has given us a culprit: -2> 2019-10-18 16:19:06.934 7faefa470700 0 mds.0.cache.ino(0x1995e63) "/unimportant/path/we/can/tolerate/losing/compat.py"10995216789470"2018-03-24 03:18:17.621969""2018-03-24 03:18:17.620969"3318855521001{ "dir_hash": 0 } { "stripe_unit": 4194304, "stripe_count": 1, "object_size": 4194304, "pool_id": 1, "pool_ns": "" } [] 3411844674407370955161500"2015-01-27 16:01:52.467669""2018-03-24 03:18:17.621969"21-1[] { "version": 0, "mtime": "0.00", "num_files": 0, "num_subdirs": 0 } { "version": 0, "rbytes": 34, "rfiles": 1, "rsubdirs": 0, "rsnaps": 0, "rctime": "0.00" } { "version": 0, "rbytes": 34, "rfiles": 1, "rsubdirs": 0, "rsnaps": 0, "rctime": "0.00" } 2540123[] { "splits": [] } true{ "replicas": {} } { "authority": [ 0, -2 ], "replica_nonce": 0 } 0falsefalse{} 0{ "gather_set": [], "state": "lock", "is_leased": false, "num_rdlocks": 0, "num_wrlocks": 0, "num_xlocks": 0, "xlock_by": {} } {} {} {} {} {} {} {} {} {} [ "auth" ] [] -1-1[] [] -1> 2019-10-18 16:19:06.964 7faefa470700 -1 /opt/app-root/src/ceph/src/mds/CInode.cc: In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7faefa470700 time 2019-10-18 16:19:06.934662 /opt/app-root/src/ceph/src/mds/CInode.cc: 746: FAILED ceph_assert(is_dir()) ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1aa) [0x7faf0a9ce39e] 2: (()+0x12a8620) [0x7faf0a9ce620] 3: (CInode::get_or_open_dirfrag(MDCache*, frag_t)+0x253) [0x557562a4b1ad] 4: (OpenFileTable::_prefetch_dirfrags()+0x4db) [0x557562b63d63] 5: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x16a) [0x557562b63720] 6: (C_OFT_OpenInoFinish::finish(int)+0x2d) [0x557562b67699] 7: (Context::complete(int)+0x27) [0x557562657fbf] 8: (MDSContext::complete(int)+0x152) [0x557562b04aa4] 9: (void finish_contexts > >(CephContext*, std::vector >&, int)+0x2c8) [0x557562660e36] 10: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x185) [0x557562844c4d] 11: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v14_2_0::list&, int)+0xbbf) [0x557562842785] 12: (C_IO_MDC_OpenInoBacktraceFetched::finish(int)+0x37) [0x557562886a31] 13: (Context::complete(int)+0x27) [0x557562657fbf] 14: (MDSContext::complete(int)+0x152) [0x557562b04aa4] 15: (MDSIOContextBase::complete(int)+0x345) [0x557562b0522d] 16: (Finisher::finisher_thread_entry()+0x38b) [0x7faf0a9033e1] 17: (Finisher::FinisherThread::entry()+0x1c) [0x5575626a2772] 18: (Thread::entry_wrapper()+0x78) [0x7faf0a97203c] 19: (Thread::_entry_func(void*)+0x18) [0x7faf0a971fba] 20: (()+0x7dd5) [0x7faf07844dd5] 21: (clone()+0x6d) [0x7faf064f502d] I tried removing it, but it does not show up in the omapkeys for that inode: lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 1995e63. __about__.py_head __init__.py_head __pycache___head _compat.py_head _structures.py_head markers.py_head requirements.py_head specifiers.py_head utils.py_head version.py_head lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 1995e63. _compat.py_head lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 1995e63. compat.py_head lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 1995e63. file-does-not-exist_head lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 1995e63. __about__.py_head __init__.py_head __pycache___head _structures.py_head markers.py_head requirements.py_head specifiers.py_head utils.py_head version.py_head Predictably, this did nothing to solve our problem, and ceph-mds is still dying during startup. Any suggestions? Neale Pickett A-4: Advanced Research in Cyber Systems Los Alamos National Laboratory ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds servers in endless segfault loop
I have created an anonymized crash log at https://pastebin.ubuntu.com/p/YsVXQQTBCM/ in the hopes that it can help someone understand what's leading to our MDS outage. Thanks in advance for any assistance. From: Pickett, Neale T Sent: Thursday, October 10, 2019 21:46 To: ceph-users@lists.ceph.com Subject: mds servers in endless segfault loop Hello, ceph-users. Our mds servers keep segfaulting from a failed assertion, and for the first time I can't find anyone else who's posted about this problem. None of them are able to stay up, so our cephfs is down. We recently had to truncate the journal log after an upgrade to nautilus, and now we have lots of dup inodes, failed to open inode, and badness: got (but i already had) messages in the recent event dump, if that's relevant. I don't know which parts of that are going to be the most relevant, but here are the last ten: -10> 2019-10-11 03:30:35.258 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843c err -22/0 -9> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843c err -22/0 -8> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843d err -22/-22 -7> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843e err -22/-22 -6> 2019-10-11 03:30:35.261 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843f err -22/-22 -5> 2019-10-11 03:30:35.261 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1845a err -22/-22 -4> 2019-10-11 03:30:35.262 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1845e err -22/-22 -3> 2019-10-11 03:30:35.262 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1846f err -22/-22 -2> 2019-10-11 03:30:35.263 7fd080a69700 0 mds.0.cache failed to open ino 0x1a18470 err -22/-22 -1> 2019-10-11 03:30:35.273 7fd080a69700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc: In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7fd080a69700 time 2019-10-11 03:30:35.273849 I'm happy to provide any other information that would help diagnose the issue. I don't have any guesses about what else would be helpful, though. Thanks in advance for any help! Neale Pickett A-4: Advanced Research in Cyber Systems Los Alamos National Laboratory ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mds servers in endless segfault loop
Hello, ceph-users. Our mds servers keep segfaulting from a failed assertion, and for the first time I can't find anyone else who's posted about this problem. None of them are able to stay up, so our cephfs is down. We recently had to truncate the journal log after an upgrade to nautilus, and now we have lots of dup inodes, failed to open inode, and badness: got (but i already had) messages in the recent event dump, if that's relevant. I don't know which parts of that are going to be the most relevant, but here are the last ten: -10> 2019-10-11 03:30:35.258 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843c err -22/0 -9> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843c err -22/0 -8> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843d err -22/-22 -7> 2019-10-11 03:30:35.260 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843e err -22/-22 -6> 2019-10-11 03:30:35.261 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1843f err -22/-22 -5> 2019-10-11 03:30:35.261 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1845a err -22/-22 -4> 2019-10-11 03:30:35.262 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1845e err -22/-22 -3> 2019-10-11 03:30:35.262 7fd080a69700 0 mds.0.cache failed to open ino 0x1a1846f err -22/-22 -2> 2019-10-11 03:30:35.263 7fd080a69700 0 mds.0.cache failed to open ino 0x1a18470 err -22/-22 -1> 2019-10-11 03:30:35.273 7fd080a69700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc: In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7fd080a69700 time 2019-10-11 03:30:35.273849 I'm happy to provide any other information that would help diagnose the issue. I don't have any guesses about what else would be helpful, though. Thanks in advance for any help! Neale Pickett A-4: Advanced Research in Cyber Systems Los Alamos National Laboratory ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat
`ceph versions` reports: { "mon": { "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 3 }, "osd": { "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 197, "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)": 11 }, "mds": { "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2, "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1 }, "overall": { "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 199, "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 7, "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)": 11 } } I didn't realize we were in such a weird state with versions, we'll update all those to 12.2.10 today :) From: Yan, Zheng Sent: Tuesday, April 2, 2019 20:26 To: Sergey Malinin Cc: Pickett, Neale T; ceph-users Subject: Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat Looks like http://tracker.ceph.com/issues/37399. which version of ceph-mds do you use? On Tue, Apr 2, 2019 at 7:47 AM Sergey Malinin wrote: > > These steps pretty well correspond to > http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ > Were you able to replay journal manually with no issues? IIRC, > "cephfs-journal-tool recover_dentries" would lead to OOM in case of MDS doing > so, and it has already been discussed on this list. > > > April 2, 2019 1:37 AM, "Pickett, Neale T" wrote: > > Here is what I wound up doing to fix this: > > Bring down all MDSes so they stop flapping > Back up journal (as seen in previous message) > Apply journal manually > Reset journal manually > Clear session table > Clear other tables (not sure I needed to do this) > Mark FS down > Mark the rank 0 MDS as failed > Reset the FS (yes, I really mean it) > Restart MDSes > Finally get some sleep > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat
Since my problem is going to be archived on the Internet I'll keep following up, so the next person with this problem might save some time. The seek was because ext4 can't seek to 23TB, but changing to an xfs mount to create this file resulted in success. Here is what I wound up doing to fix this: * Bring down all MDSes so they stop flapping * Back up journal (as seen in previous message) * Apply journal manually * Reset journal manually * Clear session table * Clear other tables (not sure I needed to do this) * Mark FS down * Mark the rank 0 MDS as failed * Reset the FS (yes, I really mean it) * Restart MDSes * Finally get some sleep If anybody has any idea what may have caused this situation, I am keenly interested. If not, hopefully I at least helped someone else. From: Pickett, Neale T Sent: Monday, April 1, 2019 12:31 To: ceph-users@lists.ceph.com Subject: Re: MDS allocates all memory (>500G) replaying, OOM-killed, repeat We decided to go ahead and try truncating the journal, but before we did, we would try to back it up. However, there are ridiculous values in the header. It can't write a journal this large because (I presume) my ext4 filesystem can't seek to this position in the (sparse) file. I would not be surprised to learn that memory allocation is trying to do something similar, hence the allocation of all available memory. This seems like a new kind of journal corruption that isn't being reported correctly. [root@lima /]# time cephfs-journal-tool --cluster=prodstore journal export backup.bin journal is 24652730602129~673601102 2019-04-01 17:49:52.776977 7fdcb999e040 -1 Error 22 ((22) Invalid argument) seeking to 0x166be9401291 Error ((22) Invalid argument) real0m27.832s user0m2.028s sys 0m3.438s [root@lima /]# cephfs-journal-tool --cluster=prodstore event get summary Events by type: EXPORT: 187 IMPORTFINISH: 182 IMPORTSTART: 182 OPEN: 3133 SUBTREEMAP: 129 UPDATE: 42185 Errors: 0 [root@lima /]# cephfs-journal-tool --cluster=prodstore header get { "magic": "ceph fs volume v011", "write_pos": 24653404029749, "expire_pos": 24652730602129, "trimmed_pos": 24652730597376, "stream_format": 1, "layout": { "stripe_unit": 4194304, "stripe_count": 1, "object_size": 4194304, "pool_id": 2, "pool_ns": "" } } [root@lima /]# printf "%x\n" "24653404029749" 166c1163c335 [root@lima /]# printf "%x\n" "24652730602129" 166be9401291 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat
We decided to go ahead and try truncating the journal, but before we did, we would try to back it up. However, there are ridiculous values in the header. It can't write a journal this large because (I presume) my ext4 filesystem can't seek to this position in the (sparse) file. I would not be surprised to learn that memory allocation is trying to do something similar, hence the allocation of all available memory. This seems like a new kind of journal corruption that isn't being reported correctly. [root@lima /]# time cephfs-journal-tool --cluster=prodstore journal export backup.bin journal is 24652730602129~673601102 2019-04-01 17:49:52.776977 7fdcb999e040 -1 Error 22 ((22) Invalid argument) seeking to 0x166be9401291 Error ((22) Invalid argument) real0m27.832s user0m2.028s sys 0m3.438s [root@lima /]# cephfs-journal-tool --cluster=prodstore event get summary Events by type: EXPORT: 187 IMPORTFINISH: 182 IMPORTSTART: 182 OPEN: 3133 SUBTREEMAP: 129 UPDATE: 42185 Errors: 0 [root@lima /]# cephfs-journal-tool --cluster=prodstore header get { "magic": "ceph fs volume v011", "write_pos": 24653404029749, "expire_pos": 24652730602129, "trimmed_pos": 24652730597376, "stream_format": 1, "layout": { "stripe_unit": 4194304, "stripe_count": 1, "object_size": 4194304, "pool_id": 2, "pool_ns": "" } } [root@lima /]# printf "%x\n" "24653404029749" 166c1163c335 [root@lima /]# printf "%x\n" "24652730602129" 166be9401291 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat
Hello We are experiencing an issue where our ceph MDS gobbles up 500G of RAM, is killed by the kernel, dies, then repeats. We have 3 MDS daemons on different machines, and all are exhibiting this behavior. We are running the following versions (from Docker): * ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7 * ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7 * ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7 (downgraded in last-ditch effort to resolve, didn't help) The machines hosting the MDS instances have 512G RAM. We tried adding swap, and the MDS just started eating into the swap (and got really slow, eventually being kicked out for exceeding the mds_beacon_grace of 240). mds_cache_memory_limit has been many values ranging from 200G to the default of 1073741824, and the result of replay is always the same: keep allocating memory until the kernel OOM killer stops it (or the mds_beacon_grace period expires, if swap is enabled). Before it died, the active MDS reported 1.592 million inodes to Prometheus (ceph_mds_inodes) and 1.493 million caps (ceph_mds_caps). This appears to be the same problem as http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030872.html At this point I feel like my best option is to try to destroy the journal and hope things come back, but while we can probably recover from this, I'd like to prevent it happening in the future. Any advice? Neale Pickett A-4: Advanced Research in Cyber Systems Los Alamos National Laboratory ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com