Hi CephFS experts.
1./ We are using Ceph and CephFS 9.2.0 with an active mds and a standby-replay
mds (standard config)
# ceph -s
cluster <CLUSTERID>
health HEALTH_OK
monmap e1: 3 mons at
{mon1=<MON1_IP>:6789/0,mon2=<MON2_IP>:6789/0,mon3=<MON3_IP>:6789/0}
election epoch 98, quorum 0,1,2 mon1,mon3,mon2
mdsmap e102: 1/1/1 up {0=mds2=up:active}, 1 up:standby-replay
osdmap e689: 64 osds: 64 up, 64 in
flags sortbitwise
pgmap v2006627: 3072 pgs, 3 pools, 106 GB data, 85605 objects
323 GB used, 174 TB / 174 TB avail
3072 active+clean
client io 1191 B/s rd, 2 op/s
2./ Today, the standby-replay mds crashed but the active mds continued ok. The
logs (following this email) show a problem creating a thread.
3./ Our ganglia monitoring shows:
- a tremendous increase of load in the system
- a tremendous peak of network connectivity for inbound traffic
- No excessive memory usage nor excessive number of processes running.
4./ For now, we just restarted the standby-replay mds, which seems to be happy
again.
Have any of you hit this issue before?
TIA
Goncalo
.
# cat /var/log/ceph/ceph-mds.mds.log
(... snap...)
2016-02-02 02:53:28.608130 7f047679d700 1 mds.0.0 standby_replay_restart (as
standby)
2016-02-02 02:53:28.614498 7f0474799700 1 mds.0.0 replay_done (as standby)
2016-02-02 02:53:29.614593 7f047679d700 1 mds.0.0 standby_replay_restart (as
standby)
2016-02-02 02:53:29.620953 7f0474799700 1 mds.0.0 replay_done (as standby)
2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0 standby_replay_restart (as
standby)
2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In function 'void
Thread::create(size_t)' thread 7f0474799700 time 2016-02-02 02:53:30.626626
common/Thread.cc: 154: FAILED assert(ret == 0)
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85)
[0x7f047ee2e105]
2: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
3: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
4: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
5: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96)
[0x7f047ea898b6]
6: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
7: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
8: (()+0x7dc5) [0x7f047dc7adc5]
9: (clone()+0x6d) [0x7f047cb6521d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- begin dump of recent events ---
(... snap...)
-10> 2016-02-02 02:53:30.621036 7f047679d700 1 mds.0.0
standby_replay_restart (as standby)
-9> 2016-02-02 02:53:30.621091 7f047679d700 1 --
<BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER8_IP>:6808/23967 --
osd_op(mds.245029.0:7272223 200.00000000 [read 0~0] 5.844f3494
ack+read+known_if_redirected+full_force e689) v6 -- ?+0 0x7f0492e67600 con
0x7f04931c5b80
-8> 2016-02-02 02:53:30.624919 7f0466c66700 1 --
<BACKUP_MDS_IP>:6801/31961 <== osd.62 <OSD_SERVER8_IP>:6808/23967 1556070 ====
osd_op_reply(7272223 200.00000000 [read 0~90] v0'0 uv8348 ondisk = 0) v6 ====
179+0+90 (526420493 0 2009462618) 0x7f04b53fc000 con 0x7f04931c5b80
-7> 2016-02-02 02:53:30.625029 7f0474799700 1 mds.245029.journaler(ro)
probing for end of the log
-6> 2016-02-02 02:53:30.625094 7f0474799700 1 --
<BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER4_IP>:6814/11760 --
osd_op(mds.245029.0:7272224 200.00000537 [stat] 5.a003dca
ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0
0x7f04bd8c6680 con 0x7f04af654aa0
-5> 2016-02-02 02:53:30.625168 7f0474799700 1 --
<BACKUP_MDS_IP>:6801/31961 --> <OSD_SERVER1_IP>:6814/11663 --
osd_op(mds.245029.0:7272225 200.00000538 [stat] 5.aa28907c
ack+read+rwordered+known_if_redirected+full_force e689) v6 -- ?+0
0x7f04a72d1440 con 0x7f04b4b08d60
-4> 2016-02-02 02:53:30.626365 7f00e3e1c700 1 --
<BACKUP_MDS_IP>:6801/31961 <== osd.1 <OSD_SERVER1_IP>:6814/11663 92601 ====
osd_op_reply(7272225 200.00000538 [stat] v0'0 uv0 ack = -2 ((2) No such file or
directory)) v6 ==== 179+0+0 (3023689365 0 0) 0x7f04913899c0 con 0x7f04b4b08d60
-3> 2016-02-02 02:53:30.626433 7f0374d69700 1 --
<BACKUP_MDS_IP>:6801/31961 <== osd.24 <OSD_SERVER4_IP>:6814/11760 93044 ====
osd_op_reply(7272224 200.00000537 [stat] v0'0 uv1909 ondisk = 0) v6
==== 179+0+16 (736135707 0 822294467) 0x7f0482b3cc00 con 0x7f04af654aa0
-2> 2016-02-02 02:53:30.626500 7f0474799700 1 mds.245029.journaler(ro)
_finish_reprobe new_end = 5600997154 (header had 5600996718).
-1> 2016-02-02 02:53:30.626525 7f0474799700 2 mds.0.0 boot_start 2:
replaying mds log
0> 2016-02-02 02:53:30.640483 7f0474799700 -1 common/Thread.cc: In
function 'void Thread::create(size_t)' thread 7f0474799700 time 2016-02-02
02:53:30.626626
common/Thread.cc: 154: FAILED assert(ret == 0)
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mds.mds.log
--- end dump of recent events ---
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (()+0x4b6fa2) [0x7f047ed40fa2]
2: (()+0xf100) [0x7f047dc82100]
3: (gsignal()+0x37) [0x7f047caa45f7]
4: (abort()+0x148) [0x7f047caa5ce8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f047d3a89d5]
6: (()+0x5e946) [0x7f047d3a6946]
7: (()+0x5e973) [0x7f047d3a6973]
8: (()+0x5eb93) [0x7f047d3a6b93]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0x7f047ee2e2fa]
10: (Thread::create(unsigned long)+0x8a) [0x7f047ee19d1a]
11: (MDLog::replay(MDSInternalContextBase*)+0xe8) [0x7f047ecacd78]
12: (MDSRank::boot_start(MDSRank::BootStep, int)+0xe2e) [0x7f047ea88d3e]
13: (MDSRank::_standby_replay_restart_finish(int, unsigned long)+0x96)
[0x7f047ea898b6]
14: (MDSIOContextBase::complete(int)+0xa4) [0x7f047eca3cc4]
15: (Finisher::finisher_thread_entry()+0x168) [0x7f047ed63fb8]
16: (()+0x7dc5) [0x7f047dc7adc5]
17: (clone()+0x6d) [0x7f047cb6521d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- begin dump of recent events ---
-7> 2016-02-02 02:53:30.698174 7f033eaed700 2 --
<BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000
sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).reader couldn't read tag,
(0) Success
-6> 2016-02-02 02:53:30.698222 7f033eaed700 2 --
<BACKUP_MDS_IP>:6801/31961 >> <OSD_SERVER2_IP>:6810/8474 pipe(0x7f04a30c9000
sd=77 :38035 s=2 pgs=35114 cs=1 l=1 c=0x7f04a2dac100).fault (0) Success
-5> 2016-02-02 02:53:30.698275 7f04790a3700 1 mds.245029.objecter
ms_handle_reset on osd.12
-4> 2016-02-02 02:53:30.698286 7f04790a3700 1 --
<BACKUP_MDS_IP>:6801/31961 mark_down 0x7f04a2dac100 -- pipe dne
-3> 2016-02-02 02:53:30.700816 7f0475f9c700 10 monclient: _send_mon_message
to mon.mon3 at <MON3_IP>:6789/0
-2> 2016-02-02 02:53:30.700829 7f0475f9c700 1 --
<BACKUP_MDS_IP>:6801/31961 --> <MON3_IP>:6789/0 -- mdsbeacon(245029/mds
up:standby-replay seq 606971 v99) v4 -- ?+0 0x7f04beedaa00 con 0x7f0482aac2c0
-1> 2016-02-02 02:53:30.702635 7f04790a3700 1 --
<BACKUP_MDS_IP>:6801/31961 <== mon.1 <MON3_IP>:6789/0 625316 ====
mdsbeacon(245029/mds up:standby-replay seq 606971 v99) v4 ==== 121+0+0
(538935308 0 0) 0x7f04beeb5200 con 0x7f0482aac2c0
0> 2016-02-02 02:53:30.751289 7f0474799700 -1 *** Caught signal (Aborted)
**
in thread 7f0474799700
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mds.mds.log
--- end dump of recent events ---
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com