Hello,

We had our cluster failed again this morning. It took almost the day to
stabilize.Here are some problems in OSD's logs we have encountered :

*Some OSDs refused to start :*

-1> 2016-11-23 15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have
pgid 9.268 at epoch 196874, but missing map.  Crashing.

0> 2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")



ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f5f5c1d35b5]

2: (OSD::load_pgs()+0x1f07) [0x7f5f5bb53b57]

3: (OSD::init()+0x2086) [0x7f5f5bb64e56]

4: (main()+0x2c55) [0x7f5f5bac8be5]

5: (__libc_start_main()+0xf5) [0x7f5f586b7b15]

6: (()+0x353009) [0x7f5f5bb13009]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.



--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds



We finaly arrived to start them by removing the PG without map of the OSD
when it was in "active+clean" state on the cluster. We used for this the
ceph-objectstore-tool


*Some OSDs who start but suicid after 3 mn* :



-5> 2016-11-23 15:32:28.488489 7fbe411ff700  5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-4> 2016-11-23 15:32:30.188632 7fbe411ff700  5 osd.24 197525 heartbeat:
osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-3> 2016-11-23 15:32:32.678977 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60

-2> 2016-11-23 15:32:32.679010 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60

-1> 2016-11-23 15:32:32.679016 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had suicide timed out after 180

 0> 2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")



We have no explanation for this


*Some Dead-Locks* :



                        The OSD.32 refused to start because the PG 9.72 has
no map


                        -1> 2016-11-23 15:02:32.675283 7f2b74492800 -1
osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but missing map.
Crashing.

0> 2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")



                        PG 9.72 is in state « down+peering » and waiting
for OSD.32 to start or to be set "lost"



We have to declare to OSD lost because of these deadlocks



*Some messages in log we'd like to have an explanation :*



2016-11-23 15:02:32.202200 7f2b74492800  0 set uid:gid to 167:167
(ceph:ceph)

2016-11-23 15:02:32.202240 7f2b74492800  0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1718781

2016-11-23 15:02:32.203557 7f2b74492800  0 pidfile_write: ignore empty
--pid-file

2016-11-23 15:02:32.231376 7f2b74492800  0
filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic 0x58465342)

2016-11-23 15:02:32.231935 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option

2016-11-23 15:02:32.231941 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

2016-11-23 15:02:32.231961 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: splice
is supported

2016-11-23 15:02:32.232777 7f2b74492800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)

2016-11-23 15:02:32.232824 7f2b74492800  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_feature: extsize is
disabled by conf

2016-11-23 15:02:32.233704 7f2b74492800  1 leveldb: Recovering log #102027

2016-11-23 15:02:32.234863 7f2b74492800  1 leveldb: Delete type=3 #102026



2016-11-23 15:02:32.234926 7f2b74492800  1 leveldb: Delete type=0 #102027



2016-11-23 15:02:32.235444 7f2b74492800  0
filestore(/var/lib/ceph/osd/ceph-32) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled

2016-11-23 15:02:32.237484 7f2b74492800  1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238027 7f2b74492800  1 journal _open
/var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238992 7f2b74492800  1
filestore(/var/lib/ceph/osd/ceph-32) upgrade

2016-11-23 15:02:32.239727 7f2b74492800  0 <cls>
cls/hello/cls_hello.cc:305: loading cls_hello

2016-11-23 15:02:32.240153 7f2b74492800  0 <cls>
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan

2016-11-23 15:02:32.245427 7f2b74492800  0 osd.32 196921 crush map has
features 1107558400, adjusting msgr requires for clients

2016-11-23 15:02:32.245435 7f2b74492800  0 osd.32 196921 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons

2016-11-23 15:02:32.245439 7f2b74492800  0 osd.32 196921 crush map has
features 1107558400, adjusting msgr requires for osds

2016-11-23 15:02:32.639715 7f2b74492800  0 osd.32 196921 load_pgs


If you have some answers ... i'll take them


Vincent
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to