Hello,
We had our cluster failed again this morning. It took almost the day to
stabilize.Here are some problems in OSD's logs we have encountered :
*Some OSDs refused to start :*
-1> 2016-11-23 15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have
pgid 9.268 at epoch 196874, but missing map. Crashing.
0> 2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f5f5c1d35b5]
2: (OSD::load_pgs()+0x1f07) [0x7f5f5bb53b57]
3: (OSD::init()+0x2086) [0x7f5f5bb64e56]
4: (main()+0x2c55) [0x7f5f5bac8be5]
5: (__libc_start_main()+0xf5) [0x7f5f586b7b15]
6: (()+0x353009) [0x7f5f5bb13009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
We finaly arrived to start them by removing the PG without map of the OSD
when it was in "active+clean" state on the cluster. We used for this the
ceph-objectstore-tool
*Some OSDs who start but suicid after 3 mn* :
-5> 2016-11-23 15:32:28.488489 7fbe411ff700 5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])
-4> 2016-11-23 15:32:30.188632 7fbe411ff700 5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])
-3> 2016-11-23 15:32:32.678977 7fbe67ce3700 1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60
-2> 2016-11-23 15:32:32.679010 7fbe67ce3700 1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60
-1> 2016-11-23 15:32:32.679016 7fbe67ce3700 1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had suicide timed out after 180
0> 2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
We have no explanation for this
*Some Dead-Locks* :
The OSD.32 refused to start because the PG 9.72 has
no map
-1> 2016-11-23 15:02:32.675283 7f2b74492800 -1
osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but missing map.
Crashing.
0> 2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")
PG 9.72 is in state « down+peering » and waiting
for OSD.32 to start or to be set "lost"
We have to declare to OSD lost because of these deadlocks
*Some messages in log we'd like to have an explanation :*
2016-11-23 15:02:32.202200 7f2b74492800 0 set uid:gid to 167:167
(ceph:ceph)
2016-11-23 15:02:32.202240 7f2b74492800 0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1718781
2016-11-23 15:02:32.203557 7f2b74492800 0 pidfile_write: ignore empty
--pid-file
2016-11-23 15:02:32.231376 7f2b74492800 0
filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic 0x58465342)
2016-11-23 15:02:32.231935 7f2b74492800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2016-11-23 15:02:32.231941 7f2b74492800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2016-11-23 15:02:32.231961 7f2b74492800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: splice
is supported
2016-11-23 15:02:32.232777 7f2b74492800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2016-11-23 15:02:32.232824 7f2b74492800 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_feature: extsize is
disabled by conf
2016-11-23 15:02:32.233704 7f2b74492800 1 leveldb: Recovering log #102027
2016-11-23 15:02:32.234863 7f2b74492800 1 leveldb: Delete type=3 #102026
2016-11-23 15:02:32.234926 7f2b74492800 1 leveldb: Delete type=0 #102027
2016-11-23 15:02:32.235444 7f2b74492800 0
filestore(/var/lib/ceph/osd/ceph-32) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2016-11-23 15:02:32.237484 7f2b74492800 1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1
2016-11-23 15:02:32.238027 7f2b74492800 1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1
2016-11-23 15:02:32.238992 7f2b74492800 1
filestore(/var/lib/ceph/osd/ceph-32) upgrade
2016-11-23 15:02:32.239727 7f2b74492800 0 <cls>
cls/hello/cls_hello.cc:305: loading cls_hello
2016-11-23 15:02:32.240153 7f2b74492800 0 <cls>
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2016-11-23 15:02:32.245427 7f2b74492800 0 osd.32 196921 crush map has
features 1107558400, adjusting msgr requires for clients
2016-11-23 15:02:32.245435 7f2b74492800 0 osd.32 196921 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2016-11-23 15:02:32.245439 7f2b74492800 0 osd.32 196921 crush map has
features 1107558400, adjusting msgr requires for osds
2016-11-23 15:02:32.639715 7f2b74492800 0 osd.32 196921 load_pgs
If you have some answers ... i'll take them
Vincent
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com