dear,
     my ceph cluster worked for about two weeks,  mds crashed every 2-3 days,
Now it stuck on replay , looks like replay crash and restart mds process again
 what can i do for this?

 1015 => # ceph -s
cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18
health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is 
degraded; noscrub,nodeep-scrub flag(s) set
monmap e1: 2 mons at 
{storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election 
epoch 26, quorum 0,1 storage-1-213,storage-1-214
mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby
osdmap e1932: 18 osds: 18 up, 18 in
flags noscrub,nodeep-scrub
pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects
4479 GB used, 32292 GB / 36772 GB avail
444 active+clean
56 active+clean+inconsistent
client io 125 MB/s rd, 31 op/s

MDS log here:

014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620 
from mon.0
2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby
2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621 
from mon.0
2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now 
mds.0.272
2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state change 
up:standby --> up:replay
2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start
2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set
2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is
2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929, 
have 1927
2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929 
(which blacklists prior instance)
2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for 
myself; not doing anything
2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening 
inotable
2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening 
sessionmap
2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening mds 
log
2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log bounds
2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening snap 
table
2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.213:6806/6114
2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.213:6811/6385
2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.213:6801/6110
2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200 to 
recover...
2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.214:6806/6238
2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered.
2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in 
format 1
2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1: 
loading/discovering base inodes
2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode 
with ino:100
2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode 
with ino:1
2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.214:6811/6276
2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying mds 
log
2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 
-> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() 
hs=0+0,ss=0+0 0x5da0000]
2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1 / 
[2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 
0x5da0000]
2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 
-> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() 
hs=0+0,ss=0+0 0x5da03b8]
2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir 100 
~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | 
subtree=1 0x5da03b8]
2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=1076158 
cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) 
n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 
31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 
31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] bound_dfs []
2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=1076158 
cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) 
n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 
31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 
31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] bounds
2014-09-18 12:38:55.675129 7f823ca08700 7 mds.0.cache current root is [dir 1 / 
[2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 
17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 
31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 
31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000]
2014-09-18 12:38:55.675142 7f823ca08700 7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth 
v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 
rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 
0x5da03b8] bound_dfs []
2014-09-18 12:38:55.675152 7f823ca08700 7 mds.0.cache 
adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth 
v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 
rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 
0x5da03b8] bounds
2014-09-18 12:38:55.675160 7f823ca08700 7 mds.0.cache current root is [dir 100 
~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 
11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 
| subtree=1 0x5da03b8]
2014-09-18 12:38:55.675167 7f823ca08700 7 mds.0.cache recalc_auth_bits (replay)
2014-09-18 12:38:55.675171 7f823ca08700 7 mds.0.cache show_cache
2014-09-18 12:38:55.675172 7f823ca08700 7 mds.0.cache unlinked [inode 100 
[...2,head] ~mds0/ auth v85546 snaprealm=0x3bf1d40 f(v0 11=1+10) n(v16575 
rc2014-09-15 18:01:15.766451 b2512216 83=72+11)/n(v0 12=1+11) (iversion lock) | 
dirfrag=1 0x3d68000]
2014-09-18 12:38:55.675187 7f823ca08700 7 mds.0.cache dirfrag [dir 100 ~mds0/ 
[2,head] auth v=1288520 cv=0/0 dir_auth=0 state=1073741824 f(v0 11=1+10) 
n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | 
subtree=1 0x5da03b8]
2014-09-18 12:38:55.675207 7f823ca08700 7 mds.0.cache unlinked [inode 1 
[...2,head] / auth v1 snaprealm=0x3bf21c0 f() n(v0 1=0+1) (iversion lock) | 
dirfrag=1 0x3d68938]
2014-09-18 12:38:55.675213 7f823ca08700 7 mds.0.cache dirfrag [dir 1 / [2,head] 
auth v=1076158 cv=0/0 dir_auth=0 state=1073741824 f(v0 m2014-09-09 
17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 
31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 
31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000]
2014-09-18 12:38:56.089244 7f8240512700 5 mds.0.272 ms_handle_connect on 
10.1.0.214:6801/6078
2014-09-18 12:39:08.795245 7f8240512700 5 mds.0.272 handle_mds_map epoch 622 
from mon.0
2014-09-18 12:49:15.677419 7f823dc0c700 5 mds.0.272 is_laggy 29.870013 > 15 
since last acked beacon
2014-09-18 12:49:15.680801 7f823dc0c700 5 mds.0.272 tick bailing out since we 
seem laggy
2014-09-18 12:49:15.684033 7f8240512700 5 mds.0.272 handle_mds_map epoch 623 
from mon.0
2014-09-18 12:49:15.684406 7f8240512700 1 mds.-1.-1 handle_mds_map i 
(10.1.0.213:6822/17501) dne in the mdsmap, respawning myself
2014-09-18 12:49:15.684429 7f8240512700 1 mds.-1.-1 respawn
2014-09-18 12:49:15.684432 7f8240512700 1 mds.-1.-1 e: 'ceph-mds'
2014-09-18 12:49:15.684444 7f8240512700 1 mds.-1.-1 0: 'ceph-mds'
2014-09-18 12:49:15.684447 7f8240512700 1 mds.-1.-1 1: '-i'
2014-09-18 12:49:15.684449 7f8240512700 1 mds.-1.-1 2: 'storage-1-213'
2014-09-18 12:49:15.684451 7f8240512700 1 mds.-1.-1 3: '-f'2014-09-18 
12:49:16.475451 7f32533f77a0 0 ceph version 0.85 
(a0c22842db9eaee9840136784e94e50fabe77187), process ceph-mds, pid 17501
2014-09-18 12:49:16.487916 7f324e4f1700 5 mds.-1.0 ms_handle_connect on 
10.1.0.213:6789/0
2014-09-18 12:49:16.517705 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch 623 
from mon.0
2014-09-18 12:49:16.900142 7f324e4f1700 5 mds.-1.-1 handle_mds_map epoch 624 
from mon.0
2014-09-18 12:49:16.900190 7f324e4f1700 1 mds.-1.0 handle_mds_map standby
2014-09-18 12:52:05.654849 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch 625 
from mon.0
2014-09-18 12:52:05.654929 7f324e4f1700 1 mds.0.274 handle_mds_map i am now 
mds.0.274
2014-09-18 12:52:05.654933 7f324e4f1700 1 mds.0.274 handle_mds_map state change 
up:standby --> up:replay
2014-09-18 12:52:05.654937 7f324e4f1700 1 mds.0.274 replay_start
2014-09-18 12:52:05.654954 7f324e4f1700 7 mds.0.cache set_recovery_set
2014-09-18 12:52:05.654960 7f324e4f1700 1 mds.0.274 recovery set is
2014-09-18 12:52:05.654962 7f324e4f1700 1 mds.0.274 need osdmap epoch 1933, 
have 1931
2014-09-18 12:52:05.654965 7f324e4f1700 1 mds.0.274 waiting for osdmap 1933 
(which blacklists prior instance)
2014-09-18 12:52:05.655022 7f324e4f1700 5 mds.0.274 handle_mds_failure for 
myself; not doing anything
2014-09-18 12:52:05.668471 7f324e4f1700 2 mds.0.274 boot_start 0: opening 
inotable
2014-09-18 12:52:05.668731 7f324e4f1700 2 mds.0.274 boot_start 0: opening 
sessionmap
2014-09-18 12:52:05.668878 7f324e4f1700 2 mds.0.274 boot_start 0: opening mds 
log
2014-09-18 12:52:05.668883 7f324e4f1700 5 mds.0.log open discovering log bounds
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to