Re: [ceph-users] Problematic inode preventing ceph-mds from starting

2019-10-28 Thread Pickett, Neale T
Hi!


Yes, resetting journals is exactly what we did, quite a while ago, when the mds 
ran out of memory because a journal entry had an absurdly large number in it (I 
think it may have been an inode number). We probably also reset the inode table 
later, which I recently learned resets a data structure on disk, and probably 
started us overwriting inodes or dentries or both.


So I take it (we are learning about filesystems very quickly over here) that 
ceph is reusing inode numbers. Re-scanning dentries will somehow figure out 
which dentry is most recent, and remove the older (now wrong) one. And somehow 
it can handle hard links, possibly (we don't have many, or any, of these).


Thanks very much for your help. This has been fascinating.


Neale





From: Patrick Donnelly 
Sent: Monday, October 28, 2019 12:52
To: Pickett, Neale T
Cc: ceph-users
Subject: Re: [ceph-users] Problematic inode preventing ceph-mds from starting

On Fri, Oct 25, 2019 at 12:11 PM Pickett, Neale T  wrote:
> In the last week we have made a few changes to the down filesystem in an 
> attempt to fix what we thought was an inode problem:
>
>
> cephfs-data-scan scan_extents   # about 1 day with 64 processes
>
> cephfs-data-scan scan_inodes   # about 1 day with 64 processes
>
> cephfs-data_scan scan_links   # about 1 day

Did you reset the journals or perform any other disaster recovery
commands? This process likely introduced the duplicate inodes.

> After these three, we tried to start an MDS and it stayed up. We then ran:
>
> ceph tell mds.a scrub start / recursive repair
>
>
> The repair ran about 3 days, spewing logs to `ceph -w` about duplicated 
> inodes, until it stopped. All looked well until we began bringing production 
> services back online, at which point many error messages appeared, the mds 
> went back into damaged, and the fs back to degraded. At this point I removed 
> the objects you suggested, which brought everything back briefly.
>
> The latest crash is:
>
> -1> 2019-10-25 18:47:50.731 7fc1f3b56700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  In function 'void MDCache::add_inode(CInode*)' thread 7fc1f3b56700 time 
> 2019-1...
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.2/rpm/el7/BUILD/ceph-14.2.2/src/mds/MDCache.cc:
>  258: FAILED ceph_assert(!p)

This error indicates a duplicate inode loaded into cache. Fixing this
probably requires significant intervention and (meta)data loss for
recent changes:

- Stop/unmount all clients. (Probably already the case if the rank is damaged!)

- Reset the MDS journal [1] and optionally recover any dentries first.
(This will hopefully resolve the ESubtreeMap errors you pasted.) Note
that some metadata may be lost through this command.

- `cephfs-data_scan scan_links` again. This should repair any
duplicate inodes (by dropping the older dentries).

- Then you can try marking the rank as repaired.

Good luck!

[1] 
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/#journal-truncation


--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problematic inode preventing ceph-mds from starting

2019-10-25 Thread Pickett, Neale T
gga [ERR]  replayed ESubtreeMap at 
30403641312085 subtree root 0x1 is not mine in cache (it's -2,-2)
2019-10-25 18:48:05.728678 mds.frigga [ERR]  replayed ESubtreeMap at 
30403641478175 subtree root 0x1 is not mine in cache (it's -2,-2)
2019-10-25 18:48:05.734157 mds.frigga [ERR] failure replaying journal 
(EMetaBlob)
2019-10-25 18:48:06.676985 mon.coffee [ERR] Health check failed: 1 filesystem 
is offline (MDS_ALL_DOWN)
2019-10-25 18:48:06.677063 mon.coffee [ERR] Health check failed: 1 mds daemon 
damaged (MDS_DAMAGE)

We now show

mds: cephfs:0/1 5 up:standby, 1 damaged

(we only have 5 mds servers)

This still smells like an inode problem to me, but I have completely run out of 
ideas, so I will do nothing more to ceph as I anxoiusly hope I am not fired for 
this 14-days-and-counting outage while awaiting a reply from the list.

Thank you very much!

Neale


From: Patrick Donnelly 
Sent: Thursday, October 24, 2019 17:57
To: Pickett, Neale T
Cc: ceph-users
Subject: Re: [ceph-users] Problematic inode preventing ceph-mds from starting

Hi Neale,

On Fri, Oct 18, 2019 at 9:31 AM Pickett, Neale T  wrote:
>
> Last week I asked about a rogue inode that was causing ceph-mds to segfault 
> during replay. We didn't get any suggestions from this list, so we have been 
> familiarizing ourselves with the ceph source code, and have added the 
> following patch:
>
>
>
> --- a/src/mds/CInode.cc
> +++ b/src/mds/CInode.cc
> @@ -736,6 +736,13 @@ CDir *CInode::get_approx_dirfrag(frag_t fg)
>
>  CDir *CInode::get_or_open_dirfrag(MDCache *mdcache, frag_t fg)
>  {
> +  if (!is_dir()) {
> +ostringstream oss;
> +JSONFormatter f(true);

f.open_object_section("inode");

(otherwise the output is hard to read)

> +dump(, DUMP_PATH | DUMP_INODE_STORE_BASE | DUMP_MDS_CACHE_OBJECT | 
> DUMP_LOCKS | DUMP_STATE | DUMP_CAPS | DUMP_DIRFRAGS);

f.close_object_section("inode")

> +f.flush(oss);
> +dout(0) << oss.str() << dendl;
> +  }
>ceph_assert(is_dir());
>
>// have it?

This seems like a generally useful patch. Feel free to submit a PR.

>
> This has given us a culprit:
>
>
>
> -2> 2019-10-18 16:19:06.934 7faefa470700  0 
> mds.0.cache.ino(0x1995e63) 
> "/unimportant/path/we/can/tolerate/losing/compat.py"10995216789470"2018-03-24 
> 03:18:17.621969""2018-03-24 03:18:17.620969"3318855521001{
> "dir_hash": 0
> }
> {
> "stripe_unit": 4194304,
> "stripe_count": 1,
> "object_size": 4194304,
> "pool_id": 1,
> "pool_ns": ""
> }
> []
> 3411844674407370955161500"2015-01-27 16:01:52.467669""2018-03-24 
> 03:18:17.621969"21-1[]
> {
> "version": 0,
> "mtime": "0.00",
> "num_files": 0,
> "num_subdirs": 0
> }
> {
> "version": 0,
> "rbytes": 34,
> "rfiles": 1,
> "rsubdirs": 0,
> "rsnaps": 0,
> "rctime": "0.00"
> }
> {
> "version": 0,
> "rbytes": 34,
> "rfiles": 1,
> "rsubdirs": 0,
> "rsnaps": 0,
> "rctime": "0.00"
> }
> 2540123""""[]
> {
> "splits": []
> }
> true{
> "replicas": {}
> }
> {
> "authority": [
> 0,
> -2
> ],
> "replica_nonce": 0
> }
> 0falsefalse{}
> 0{
> "gather_set": [],
> "state": "lock",
> "is_leased": false,
> "num_rdlocks": 0,
> "num_wrlocks": 0,
> "num_xlocks": 0,
> "xlock_by": {}
> }
> {}
> {}
> {}
> {}
> {}
> {}
> {}
> {}
> {}
> [
> "auth"
> ]
> []
> -1-1[]
> []
>
> -1> 2019-10-18 16:19:06.964 7faefa470700 -1 
> /opt/app-root/src/ceph/src/mds/CInode.cc: In function 'CDir* 
> CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7faefa470700 time 
> 2019-10-18 16:19:06.934662
> /opt/app-root/src/ceph/src/mds/CInode.cc: 746: FAILED ceph_assert(is_dir())
>
>  ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x1aa) [0x7faf0a9ce39e]
>  2: (()+0x12a8620) [0x7faf0a9ce620]
>  3: (CInode::get_or_open_dirfrag(MDCache*, frag_t)+0x253) [0x557562a4b1ad]
>  4: (OpenFileTable::_prefetch_dirfrags()+0x4db) [0x557562b63d63]
>  5: (OpenFileTabl

[ceph-users] Problematic inode preventing ceph-mds from starting

2019-10-18 Thread Pickett, Neale T
Last week I asked about a rogue inode that was causing ceph-mds to segfault 
during replay. We didn't get any suggestions from this list, so we have been 
familiarizing ourselves with the ceph source code, and have added the following 
patch:



--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -736,6 +736,13 @@ CDir *CInode::get_approx_dirfrag(frag_t fg)

 CDir *CInode::get_or_open_dirfrag(MDCache *mdcache, frag_t fg)
 {
+  if (!is_dir()) {
+ostringstream oss;
+JSONFormatter f(true);
+dump(, DUMP_PATH | DUMP_INODE_STORE_BASE | DUMP_MDS_CACHE_OBJECT | 
DUMP_LOCKS | DUMP_STATE | DUMP_CAPS | DUMP_DIRFRAGS);
+f.flush(oss);
+dout(0) << oss.str() << dendl;
+  }
   ceph_assert(is_dir());

   // have it?


This has given us a culprit:



-2> 2019-10-18 16:19:06.934 7faefa470700  0 mds.0.cache.ino(0x1995e63) 
"/unimportant/path/we/can/tolerate/losing/compat.py"10995216789470"2018-03-24 
03:18:17.621969""2018-03-24 03:18:17.620969"3318855521001{
"dir_hash": 0
}
{
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"pool_id": 1,
"pool_ns": ""
}
[]
3411844674407370955161500"2015-01-27 16:01:52.467669""2018-03-24 
03:18:17.621969"21-1[]
{
"version": 0,
"mtime": "0.00",
"num_files": 0,
"num_subdirs": 0
}
{
"version": 0,
"rbytes": 34,
"rfiles": 1,
"rsubdirs": 0,
"rsnaps": 0,
"rctime": "0.00"
}
{
"version": 0,
"rbytes": 34,
"rfiles": 1,
"rsubdirs": 0,
"rsnaps": 0,
"rctime": "0.00"
}
2540123[]
{
"splits": []
}
true{
"replicas": {}
}
{
"authority": [
0,
-2
],
"replica_nonce": 0
}
0falsefalse{}
0{
"gather_set": [],
"state": "lock",
"is_leased": false,
"num_rdlocks": 0,
"num_wrlocks": 0,
"num_xlocks": 0,
"xlock_by": {}
}
{}
{}
{}
{}
{}
{}
{}
{}
{}
[
"auth"
]
[]
-1-1[]
[]

-1> 2019-10-18 16:19:06.964 7faefa470700 -1 
/opt/app-root/src/ceph/src/mds/CInode.cc: In function 'CDir* 
CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 7faefa470700 time 
2019-10-18 16:19:06.934662
/opt/app-root/src/ceph/src/mds/CInode.cc: 746: FAILED ceph_assert(is_dir())

 ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1aa) [0x7faf0a9ce39e]
 2: (()+0x12a8620) [0x7faf0a9ce620]
 3: (CInode::get_or_open_dirfrag(MDCache*, frag_t)+0x253) [0x557562a4b1ad]
 4: (OpenFileTable::_prefetch_dirfrags()+0x4db) [0x557562b63d63]
 5: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x16a) [0x557562b63720]
 6: (C_OFT_OpenInoFinish::finish(int)+0x2d) [0x557562b67699]
 7: (Context::complete(int)+0x27) [0x557562657fbf]
 8: (MDSContext::complete(int)+0x152) [0x557562b04aa4]
 9: (void finish_contexts 
> >(CephContext*, std::vector >&, 
int)+0x2c8) [0x557562660e36]
 10: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, 
int)+0x185) [0x557562844c4d]
 11: (MDCache::_open_ino_backtrace_fetched(inodeno_t, 
ceph::buffer::v14_2_0::list&, int)+0xbbf) [0x557562842785]
 12: (C_IO_MDC_OpenInoBacktraceFetched::finish(int)+0x37) [0x557562886a31]
 13: (Context::complete(int)+0x27) [0x557562657fbf]
 14: (MDSContext::complete(int)+0x152) [0x557562b04aa4]
 15: (MDSIOContextBase::complete(int)+0x345) [0x557562b0522d]
 16: (Finisher::finisher_thread_entry()+0x38b) [0x7faf0a9033e1]
 17: (Finisher::FinisherThread::entry()+0x1c) [0x5575626a2772]
 18: (Thread::entry_wrapper()+0x78) [0x7faf0a97203c]
 19: (Thread::_entry_func(void*)+0x18) [0x7faf0a971fba]
 20: (()+0x7dd5) [0x7faf07844dd5]
 21: (clone()+0x6d) [0x7faf064f502d]


I tried removing it, but it does not show up in the omapkeys for that inode:

lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 
1995e63.
__about__.py_head
__init__.py_head
__pycache___head
_compat.py_head
_structures.py_head
markers.py_head
requirements.py_head
specifiers.py_head
utils.py_head
version.py_head
lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
1995e63. _compat.py_head
lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
1995e63. compat.py_head
lima:/home/neale$ ceph -- rados -p cephfs_metadata rmomapkey 
1995e63. file-does-not-exist_head
lima:/home/neale$ ceph -- rados -p cephfs_metadata listomapkeys 
1995e63.
__about__.py_head
__init__.py_head
__pycache___head
_structures.py_head
markers.py_head
requirements.py_head
specifiers.py_head
utils.py_head
version.py_head

Predictably, this did nothing to solve our problem, and ceph-mds is still dying 
during startup.

Any suggestions?


Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds servers in endless segfault loop

2019-10-11 Thread Pickett, Neale T
I have created an anonymized crash log at 
https://pastebin.ubuntu.com/p/YsVXQQTBCM/ in the hopes that it can help someone 
understand what's leading to our MDS outage.


Thanks in advance for any assistance.



From: Pickett, Neale T
Sent: Thursday, October 10, 2019 21:46
To: ceph-users@lists.ceph.com
Subject: mds servers in endless segfault loop


Hello, ceph-users.


Our mds servers keep segfaulting from a failed assertion, and for the first 
time I can't find anyone else who's posted about this problem. None of them are 
able to stay up, so our cephfs is down.


We recently had to truncate the journal log after an upgrade to nautilus, and 
now we have lots of dup inodes, failed to open inode, and badness: got (but i 
already had) messages in the recent event dump, if that's relevant. I don't 
know which parts of that are going to be the most relevant, but here are the 
last ten:


  -10> 2019-10-11 03:30:35.258 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-9> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-8> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843d err -22/-22
-7> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843e err -22/-22
-6> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843f err -22/-22
-5> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845a err -22/-22
-4> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845e err -22/-22
-3> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1846f err -22/-22
-2> 2019-10-11 03:30:35.263 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a18470 err -22/-22
-1> 2019-10-11 03:30:35.273 7fd080a69700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc:
 In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 
7fd080a69700 time 2019-10-11 03:30:35.273849


I'm happy to provide any other information that would help diagnose the issue. 
I don't have any guesses about what else would be helpful, though.


Thanks in advance for any help!



Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mds servers in endless segfault loop

2019-10-10 Thread Pickett, Neale T
Hello, ceph-users.


Our mds servers keep segfaulting from a failed assertion, and for the first 
time I can't find anyone else who's posted about this problem. None of them are 
able to stay up, so our cephfs is down.


We recently had to truncate the journal log after an upgrade to nautilus, and 
now we have lots of dup inodes, failed to open inode, and badness: got (but i 
already had) messages in the recent event dump, if that's relevant. I don't 
know which parts of that are going to be the most relevant, but here are the 
last ten:


  -10> 2019-10-11 03:30:35.258 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-9> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843c err -22/0
-8> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843d err -22/-22
-7> 2019-10-11 03:30:35.260 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843e err -22/-22
-6> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1843f err -22/-22
-5> 2019-10-11 03:30:35.261 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845a err -22/-22
-4> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1845e err -22/-22
-3> 2019-10-11 03:30:35.262 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a1846f err -22/-22
-2> 2019-10-11 03:30:35.263 7fd080a69700  0 mds.0.cache  failed to open ino 
0x1a18470 err -22/-22
-1> 2019-10-11 03:30:35.273 7fd080a69700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/src/mds/CInode.cc:
 In function 'CDir* CInode::get_or_open_dirfrag(MDCache*, frag_t)' thread 
7fd080a69700 time 2019-10-11 03:30:35.273849


I'm happy to provide any other information that would help diagnose the issue. 
I don't have any guesses about what else would be helpful, though.


Thanks in advance for any help!



Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-03 Thread Pickett, Neale T
`ceph versions` reports:


{
"mon": {
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 3
},
"mgr": {
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 3
},
"osd": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
luminous (stable)": 197,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
luminous (stable)": 11
},
"mds": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
luminous (stable)": 2,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 1
},
"overall": {
"ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) 
luminous (stable)": 199,
"ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 7,
"ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
luminous (stable)": 11
}
}



I didn't realize we were in such a weird state with versions, we'll update all 
those to 12.2.10 today :)



From: Yan, Zheng 
Sent: Tuesday, April 2, 2019 20:26
To: Sergey Malinin
Cc: Pickett, Neale T; ceph-users
Subject: Re: [ceph-users] MDS allocates all memory (>500G) replaying, 
OOM-killed, repeat

Looks like http://tracker.ceph.com/issues/37399. which version of
ceph-mds do you use?

On Tue, Apr 2, 2019 at 7:47 AM Sergey Malinin  wrote:
>
> These steps pretty well correspond to 
> http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> Were you able to replay journal manually with no issues? IIRC, 
> "cephfs-journal-tool recover_dentries" would lead to OOM in case of MDS doing 
> so, and it has already been discussed on this list.
>
>
> April 2, 2019 1:37 AM, "Pickett, Neale T"  wrote:
>
> Here is what I wound up doing to fix this:
>
> Bring down all MDSes so they stop flapping
> Back up journal (as seen in previous message)
> Apply journal manually
> Reset journal manually
> Clear session table
> Clear other tables (not sure I needed to do this)
> Mark FS down
> Mark the rank 0 MDS as failed
> Reset the FS (yes, I really mean it)
> Restart MDSes
> Finally get some sleep
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-01 Thread Pickett, Neale T
Since my problem is going to be archived on the Internet I'll keep following 
up, so the next person with this problem might save some time.


The seek was because ext4 can't seek to 23TB, but changing to an xfs mount to 
create this file resulted in success.


Here is what I wound up doing to fix this:


  *   Bring down all MDSes so they stop flapping
  *   Back up journal (as seen in previous message)
  *   Apply journal manually
  *   Reset journal manually
  *   Clear session table
  *   Clear other tables (not sure I needed to do this)
  *   Mark FS down
  *   Mark the rank 0 MDS as failed
  *   Reset the FS (yes, I really mean it)
  *   Restart MDSes
  *   Finally get some sleep

If anybody has any idea what may have caused this situation, I am keenly 
interested. If not, hopefully I at least helped someone else.



From: Pickett, Neale T
Sent: Monday, April 1, 2019 12:31
To: ceph-users@lists.ceph.com
Subject: Re: MDS allocates all memory (>500G) replaying, OOM-killed, repeat


We decided to go ahead and try truncating the journal, but before we did, we 
would try to back it up. However, there are ridiculous values in the header. It 
can't write a journal this large because (I presume) my ext4 filesystem can't 
seek to this position in the (sparse) file.


I would not be surprised to learn that memory allocation is trying to do 
something similar, hence the allocation of all available memory. This seems 
like a new kind of journal corruption that isn't being reported correctly.

[root@lima /]# time cephfs-journal-tool --cluster=prodstore journal export 
backup.bin
journal is 24652730602129~673601102
2019-04-01 17:49:52.776977 7fdcb999e040 -1 Error 22 ((22) Invalid argument) 
seeking to 0x166be9401291
Error ((22) Invalid argument)

real0m27.832s
user0m2.028s
sys 0m3.438s
[root@lima /]# cephfs-journal-tool --cluster=prodstore event get summary
Events by type:
  EXPORT: 187
  IMPORTFINISH: 182
  IMPORTSTART: 182
  OPEN: 3133
  SUBTREEMAP: 129
  UPDATE: 42185
Errors: 0
[root@lima /]# cephfs-journal-tool --cluster=prodstore header get
{
"magic": "ceph fs volume v011",
"write_pos": 24653404029749,
"expire_pos": 24652730602129,
"trimmed_pos": 24652730597376,
"stream_format": 1,
"layout": {
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"pool_id": 2,
"pool_ns": ""
}
}

[root@lima /]# printf "%x\n" "24653404029749"
166c1163c335
[root@lima /]# printf "%x\n" "24652730602129"
166be9401291

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-01 Thread Pickett, Neale T
We decided to go ahead and try truncating the journal, but before we did, we 
would try to back it up. However, there are ridiculous values in the header. It 
can't write a journal this large because (I presume) my ext4 filesystem can't 
seek to this position in the (sparse) file.


I would not be surprised to learn that memory allocation is trying to do 
something similar, hence the allocation of all available memory. This seems 
like a new kind of journal corruption that isn't being reported correctly.

[root@lima /]# time cephfs-journal-tool --cluster=prodstore journal export 
backup.bin
journal is 24652730602129~673601102
2019-04-01 17:49:52.776977 7fdcb999e040 -1 Error 22 ((22) Invalid argument) 
seeking to 0x166be9401291
Error ((22) Invalid argument)

real0m27.832s
user0m2.028s
sys 0m3.438s
[root@lima /]# cephfs-journal-tool --cluster=prodstore event get summary
Events by type:
  EXPORT: 187
  IMPORTFINISH: 182
  IMPORTSTART: 182
  OPEN: 3133
  SUBTREEMAP: 129
  UPDATE: 42185
Errors: 0
[root@lima /]# cephfs-journal-tool --cluster=prodstore header get
{
"magic": "ceph fs volume v011",
"write_pos": 24653404029749,
"expire_pos": 24652730602129,
"trimmed_pos": 24652730597376,
"stream_format": 1,
"layout": {
"stripe_unit": 4194304,
"stripe_count": 1,
"object_size": 4194304,
"pool_id": 2,
"pool_ns": ""
}
}

[root@lima /]# printf "%x\n" "24653404029749"
166c1163c335
[root@lima /]# printf "%x\n" "24652730602129"
166be9401291

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS allocates all memory (>500G) replaying, OOM-killed, repeat

2019-04-01 Thread Pickett, Neale T
Hello


We are experiencing an issue where our ceph MDS gobbles up 500G of RAM, is 
killed by the kernel, dies, then repeats. We have 3 MDS daemons on different 
machines, and all are exhibiting this behavior. We are running the following 
versions (from Docker):


  *   ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  *   ceph/daemon:v3.2.1-stable-3.2-luminous-centos-7
  *   ceph/daemon:v3.1.0-stable-3.1-luminous-centos-7 (downgraded in last-ditch 
effort to resolve, didn't help)

The machines hosting the MDS instances have 512G RAM. We tried adding swap, and 
the MDS just started eating into the swap (and got really slow, eventually 
being kicked out for exceeding the mds_beacon_grace of 240). 
mds_cache_memory_limit has been many values ranging from 200G to the default of 
1073741824, and the result of replay is always the same: keep allocating memory 
until the kernel OOM killer stops it (or the mds_beacon_grace period expires, 
if swap is enabled).

Before it died, the active MDS reported 1.592 million inodes to Prometheus 
(ceph_mds_inodes) and 1.493 million caps (ceph_mds_caps).

This appears to be the same problem as 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030872.html

At this point I feel like my best option is to try to destroy the journal and 
hope things come back, but while we can probably recover from this, I'd like to 
prevent it happening in the future. Any advice?


Neale Pickett 
A-4: Advanced Research in Cyber Systems
Los Alamos National Laboratory
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com