Hi Ceph experts,
after updating from ceph 0.94.9 to ceph 10.2.5 on Ubuntu 14.04, 2 out of 3 osd
processes are unable to start. On another machine the same happened but only on
1 out of 3 OSDs.
The update procedure is done via ceph-deploy 1.5.37.
Shouldn’t be a permissions problem, because before updating I do a chown 64045:
64045 on the osd disks /dev/sd[bcd] and on the (separate) journal partition on
ssd /dev/sda[678]
When upgrade procedure is completed the 3 ceph osd processes are still running,
but if I restart them some of them refuses to start.
The error in /var/log/ceph/ceph-osd.271.log is full of errors like this :
2017-02-13 09:47:17.590843 7fc57248f800 0 set uid:gid to 1001:1001 (ceph:ceph)
2017-02-13 09:47:17.590859 7fc57248f800 0 ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128
2017-02-13 09:47:17.591356 7fc57248f800 0 pidfile_write: ignore empty
--pid-file
2017-02-13 09:47:17.601186 7fc57248f800 0
filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)
2017-02-13 09:47:17.601530 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2017-02-13 09:47:17.601539 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-02-13 09:47:17.601553 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is
supported
2017-02-13 09:47:17.613611 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2)
syscall fully supported (by glibc and kernel)
2017-02-13 09:47:17.613673 7fc57248f800 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is
disabled by conf
2017-02-13 09:47:17.614454 7fc57248f800 1 leveldb: Recovering log #6754
2017-02-13 09:47:17.672544 7fc57248f800 1 leveldb: Delete type=3 #6753
2017-02-13 09:47:17.672662 7fc57248f800 1 leveldb: Delete type=0 #6754
2017-02-13 09:47:17.673640 7fc57248f800 0
filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode:
checkpoint is not enabled
2017-02-13 09:47:17.684464 7fc57248f800 0 <cls> cls/hello/cls_hello.cc:305:
loading cls_hello
2017-02-13 09:47:17.688815 7fc57248f800 0 <cls> cls/cephfs/cls_cephfs.cc:202:
loading cephfs_size_scan
2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function 'OSDMapRef
OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13
09:47:17.692735
osd/OSD.h: 885: FAILED assert(ret)
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b)
[0x55ea51744dab]
2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
3: (OSD::init()+0x1ed2) [0x55ea51103872]
4: (main()+0x29d1) [0x55ea5106ae41]
5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
6: (()+0x355b17) [0x55ea510b3b17]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- begin dump of recent events ---
-29> 2017-02-13 09:47:17.587145 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command perfcounters_dump hook 0x55ea5d1d8050
-28> 2017-02-13 09:47:17.587164 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command 1 hook 0x55ea5d1d8050
-27> 2017-02-13 09:47:17.587166 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command perf dump hook 0x55ea5d1d8050
-26> 2017-02-13 09:47:17.587168 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command perfcounters_schema hook 0x55ea5d1d8050
-25> 2017-02-13 09:47:17.587170 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command 2 hook 0x55ea5d1d8050
-24> 2017-02-13 09:47:17.587172 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command perf schema hook 0x55ea5d1d8050
-23> 2017-02-13 09:47:17.587174 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command perf reset hook 0x55ea5d1d8050
-22> 2017-02-13 09:47:17.587176 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command config show hook 0x55ea5d1d8050
-21> 2017-02-13 09:47:17.587178 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command config set hook 0x55ea5d1d8050
-20> 2017-02-13 09:47:17.587181 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command config get hook 0x55ea5d1d8050
-19> 2017-02-13 09:47:17.587187 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command config diff hook 0x55ea5d1d8050
-18> 2017-02-13 09:47:17.587189 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command log flush hook 0x55ea5d1d8050
-17> 2017-02-13 09:47:17.587191 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command log dump hook 0x55ea5d1d8050
-16> 2017-02-13 09:47:17.587195 7fc57248f800 5 asok(0x55ea5d1f8280)
register_command log reopen hook 0x55ea5d1d8050
-15> 2017-02-13 09:47:17.590843 7fc57248f800 0 set uid:gid to 1001:1001
(ceph:ceph)
-14> 2017-02-13 09:47:17.590859 7fc57248f800 0 ceph version 10.2.5
(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 187128
-13> 2017-02-13 09:47:17.591356 7fc57248f800 0 pidfile_write: ignore empty
--pid-file
-12> 2017-02-13 09:47:17.601186 7fc57248f800 0
filestore(/var/lib/ceph/osd/ceph-271) backend xfs (magic 0x58465342)
-11> 2017-02-13 09:47:17.601530 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
-10> 2017-02-13 09:47:17.601539 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
-9> 2017-02-13 09:47:17.601553 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: splice is
supported
-8> 2017-02-13 09:47:17.613611 7fc57248f800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_features: syncfs(2)
syscall fully supported (by glibc and kernel)
-7> 2017-02-13 09:47:17.613673 7fc57248f800 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-271) detect_feature: extsize is
disabled by conf
-6> 2017-02-13 09:47:17.614454 7fc57248f800 1 leveldb: Recovering log #6754
-5> 2017-02-13 09:47:17.672544 7fc57248f800 1 leveldb: Delete type=3 #6753
-4> 2017-02-13 09:47:17.672662 7fc57248f800 1 leveldb: Delete type=0 #6754
-3> 2017-02-13 09:47:17.673640 7fc57248f800 0
filestore(/var/lib/ceph/osd/ceph-271) mount: enabling WRITEAHEAD journal mode:
checkpoint is not enabled
-2> 2017-02-13 09:47:17.684464 7fc57248f800 0 <cls>
cls/hello/cls_hello.cc:305: loading cls_hello
-1> 2017-02-13 09:47:17.688815 7fc57248f800 0 <cls>
cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
0> 2017-02-13 09:47:17.694483 7fc57248f800 -1 osd/OSD.h: In function
'OSDMapRef OSDService::get_map(epoch_t)' thread 7fc57248f800 time 2017-02-13
09:47:17.692735
osd/OSD.h: 885: FAILED assert(ret)
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b)
[0x55ea51744dab]
2: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
3: (OSD::init()+0x1ed2) [0x55ea51103872]
4: (main()+0x29d1) [0x55ea5106ae41]
5: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
6: (()+0x355b17) [0x55ea510b3b17]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.271.log
--- end dump of recent events ---
2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted) **
in thread 7fc57248f800 thread_name:ceph-osd
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (()+0x8f2d32) [0x55ea51650d32]
2: (()+0x10330) [0x7fc571366330]
3: (gsignal()+0x37) [0x7fc56f3c5c37]
4: (abort()+0x148) [0x7fc56f3c9028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x265) [0x55ea51744f85]
6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
7: (OSD::init()+0x1ed2) [0x55ea51103872]
8: (main()+0x29d1) [0x55ea5106ae41]
9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
10: (()+0x355b17) [0x55ea510b3b17]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- begin dump of recent events ---
0> 2017-02-13 09:47:17.696962 7fc57248f800 -1 *** Caught signal (Aborted)
**
in thread 7fc57248f800 thread_name:ceph-osd
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (()+0x8f2d32) [0x55ea51650d32]
2: (()+0x10330) [0x7fc571366330]
3: (gsignal()+0x37) [0x7fc56f3c5c37]
4: (abort()+0x148) [0x7fc56f3c9028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x265) [0x55ea51744f85]
6: (OSDService::get_map(unsigned int)+0x3d) [0x55ea5114debd]
7: (OSD::init()+0x1ed2) [0x55ea51103872]
8: (main()+0x29d1) [0x55ea5106ae41]
9: (__libc_start_main()+0xf5) [0x7fc56f3b0f45]
10: (()+0x355b17) [0x55ea510b3b17]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 1 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 0 journaler
0/ 5 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
1/ 5 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
1/10 civetweb
1/ 5 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.271.log
--- end dump of recent events ---
Removing the osd disks, zapping and recreating them fixes the problem, but I
don’t think it’s a good idea to do this for 2/3 of our 300 OSDs.
Any idea on:
1. How to avoid the problem during update
2. how to fix the failed disks reusing the data
Thank you!
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com