Hi,
I was deleting a lot of hard linked files, when "something" happened.
Now my mds starts for a few seconds, writes a lot of these lines:
-43> 2017-09-06 13:51:43.396588 7f9047b21700 10 log_client will send
2017-09-06 13:51:40.531563 mds.0 10.210.32.12:6802/2735447218 4963 : cluster [ERR]
loaded dup inode 100007d6511 [2,head] v17234443 at ~mds
0/stray8/100007d6511, but inode 100007d6511.head v17500983 already exists at
~mds0/stray7/100007d6511
And finally this:
-3> 2017-09-06 13:51:43.396762 7f9047b21700 10 monclient: _send_mon_message
to mon.2 at 10.210.34.11:6789/0
-2> 2017-09-06 13:51:43.396770 7f9047b21700 1 -- 10.210.32.12:6802/2735447218
--> 10.210.34.11:6789/0 -- log(1000 entries from seq 4003 at 2017-09-06
13:51:38.718139) v1 -- ?+0 0x7f905c5d5d40 con 0x7f905902
c600
-1> 2017-09-06 13:51:43.399561 7f9047b21700 1 -- 10.210.32.12:6802/2735447218
<== mon.2 10.210.34.11:6789/0 26 ==== mdsbeacon(152160002/0 up:active seq 8
v47532) v7 ==== 126+0+0 (20071477 0 0) 0x7f90591b208
0 con 0x7f905902c600
0> 2017-09-06 13:51:43.401125 7f9043b19700 -1 *** Caught signal (Aborted)
**
in thread 7f9043b19700 thread_name:mds_rank_progr
ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
1: (()+0x5087b7) [0x7f904ed547b7]
2: (()+0xf890) [0x7f904e156890]
3: (gsignal()+0x37) [0x7f904c5e1067]
4: (abort()+0x148) [0x7f904c5e2448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x256) [0x7f904ee5e386]
6: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x492) [0x7f904ebaad12]
7: (StrayManager::__eval_stray(CDentry*, bool)+0x5f5) [0x7f904ebaefd5]
8: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f904ebaf7ae]
9: (MDCache::scan_stray_dir(dirfrag_t)+0x165) [0x7f904eb04145]
10: (MDCache::populate_mydir()+0x7fc) [0x7f904eb73acc]
11: (MDCache::open_root()+0xef) [0x7f904eb7447f]
12: (MDSInternalContextBase::complete(int)+0x203) [0x7f904ecad5c3]
13: (MDSRank::_advance_queues()+0x382) [0x7f904ea689e2]
14: (MDSRank::ProgressThread::entry()+0x4a) [0x7f904ea68e6a]
15: (()+0x8064) [0x7f904e14f064]
16: (clone()+0x6d) [0x7f904c69462d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
99/99 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mds.0.log
--- end dump of recent events ---
Looking at daemonperf, it seems the mds crashes when trying to write something:
root@mds01:~ # /etc/init.d/ceph restart
[ ok ] Restarting ceph (via systemctl): ceph.service.
root@mds01:~ # ceph daemonperf mds.0
---objecter---
writ read actv|
0 0 0
0 0 0
0 0 0
6 12 0
0 0 0
0 0 0
0 0 0
0 3 1
0 1 1
0 0 0
0 1 0
0 1 1
0 1 1
0 1 1
0 1 1
0 0 0
0 1 0
0 1 0
0 1 1
0 0 0
64 0 0
Traceback (most recent call last):
File "/usr/bin/ceph", line 948, in <module>
retval = main()
File "/usr/bin/ceph", line 638, in main
DaemonWatcher(sockpath).run(interval, count)
File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 265, in run
dump = json.loads(admin_socket(self.asok_path, ["perf", "dump"]))
File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 60, in
admin_socket
raise RuntimeError('exception getting command descriptions: ' + str(e))
RuntimeError: exception getting command descriptions: [Errno 111] Connection
refused
And indeed, I am able to prevent the crash by running:
root@mds02:~ # ceph --admin-daemon /var/run/ceph/ceph-mds.1.asok force_readonly
during startup of the mds.
Any advice on how to repair the filesystem?
I already tried this without success:
http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/
Ceph Version used is Jewel 10.2.9.
Micha Krause
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com