Hi,

I was deleting a lot of hard linked files, when "something" happened.

Now my mds starts for a few seconds, writes a lot of these lines:

   -43> 2017-09-06 13:51:43.396588 7f9047b21700 10 log_client  will send 
2017-09-06 13:51:40.531563 mds.0 10.210.32.12:6802/2735447218 4963 : cluster [ERR] 
loaded dup inode 100007d6511 [2,head] v17234443 at ~mds
0/stray8/100007d6511, but inode 100007d6511.head v17500983 already exists at 
~mds0/stray7/100007d6511


And finally this:


    -3> 2017-09-06 13:51:43.396762 7f9047b21700 10 monclient: _send_mon_message 
to mon.2 at 10.210.34.11:6789/0
    -2> 2017-09-06 13:51:43.396770 7f9047b21700  1 -- 10.210.32.12:6802/2735447218 
--> 10.210.34.11:6789/0 -- log(1000 entries from seq 4003 at 2017-09-06 
13:51:38.718139) v1 -- ?+0 0x7f905c5d5d40 con 0x7f905902
c600
    -1> 2017-09-06 13:51:43.399561 7f9047b21700  1 -- 10.210.32.12:6802/2735447218 
<== mon.2 10.210.34.11:6789/0 26 ==== mdsbeacon(152160002/0 up:active seq 8 
v47532) v7 ==== 126+0+0 (20071477 0 0) 0x7f90591b208
0 con 0x7f905902c600
     0> 2017-09-06 13:51:43.401125 7f9043b19700 -1 *** Caught signal (Aborted) 
**
 in thread 7f9043b19700 thread_name:mds_rank_progr

 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
 1: (()+0x5087b7) [0x7f904ed547b7]
 2: (()+0xf890) [0x7f904e156890]
 3: (gsignal()+0x37) [0x7f904c5e1067]
 4: (abort()+0x148) [0x7f904c5e2448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x7f904ee5e386]
 6: (StrayManager::eval_remote_stray(CDentry*, CDentry*)+0x492) [0x7f904ebaad12]
 7: (StrayManager::__eval_stray(CDentry*, bool)+0x5f5) [0x7f904ebaefd5]
 8: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f904ebaf7ae]
 9: (MDCache::scan_stray_dir(dirfrag_t)+0x165) [0x7f904eb04145]
 10: (MDCache::populate_mydir()+0x7fc) [0x7f904eb73acc]
 11: (MDCache::open_root()+0xef) [0x7f904eb7447f]
 12: (MDSInternalContextBase::complete(int)+0x203) [0x7f904ecad5c3]
 13: (MDSRank::_advance_queues()+0x382) [0x7f904ea689e2]
 14: (MDSRank::ProgressThread::entry()+0x4a) [0x7f904ea68e6a]
 15: (()+0x8064) [0x7f904e14f064]
 16: (clone()+0x6d) [0x7f904c69462d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  99/99 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.0.log
--- end dump of recent events ---

Looking at daemonperf, it seems the mds crashes when trying to write something:

root@mds01:~ # /etc/init.d/ceph restart
[ ok ] Restarting ceph (via systemctl): ceph.service.

root@mds01:~ # ceph daemonperf mds.0
---objecter---
writ read actv|
  0    0    0
  0    0    0
  0    0    0
  6   12    0
  0    0    0
  0    0    0
  0    0    0
  0    3    1
  0    1    1
  0    0    0
  0    1    0
  0    1    1
  0    1    1
  0    1    1
  0    1    1
  0    0    0
  0    1    0
  0    1    0
  0    1    1
  0    0    0
 64    0    0
Traceback (most recent call last):
  File "/usr/bin/ceph", line 948, in <module>
    retval = main()
  File "/usr/bin/ceph", line 638, in main
    DaemonWatcher(sockpath).run(interval, count)
  File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 265, in run
    dump = json.loads(admin_socket(self.asok_path, ["perf", "dump"]))
  File "/usr/lib/python2.7/dist-packages/ceph_daemon.py", line 60, in 
admin_socket
    raise RuntimeError('exception getting command descriptions: ' + str(e))
RuntimeError: exception getting command descriptions: [Errno 111] Connection 
refused


And indeed, I am able to prevent the crash by running:

root@mds02:~ # ceph --admin-daemon /var/run/ceph/ceph-mds.1.asok force_readonly

during startup of the mds.

Any advice on how to repair the filesystem?

I already tried this without success:

http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

Ceph Version used is Jewel 10.2.9.


Micha Krause
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to