Hi Christian, I've just upgraded to 10.2.10 and the problem still persist. Both. OSD not starting (the most problematic now) and the wrong report of degraded objects:
20266198323226120/281736 objects degraded (7193329330730.229%) Any ideas about how to resolve the problem with the OSD? I checked xfs disk and seems ok. No disk errors. Smart says also it's okay. -2> 2017-10-26 00:08:34.011845 7f370854a8c0 5 osd.3 pg_epoch: 8152 pg[9.6( v 8152'4311119 (8063'4308045,8152'4311119] local-les=8152 n=282 ec=417 les/c/f 8152/8152/0 8150/8150/8118) [2,3] r=1 lpr=0 pi=8115-8149/10 crt=8152'4311119 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Initial 0.012641 0 0.000000 -1> 2017-10-26 00:08:34.011877 7f370854a8c0 5 osd.3 pg_epoch: 8152 pg[9.6( v 8152'4311119 (8063'4308045,8152'4311119] local-les=8152 n=282 ec=417 les/c/f 8152/8152/0 8150/8150/8118) [2,3] r=1 lpr=0 pi=8115-8149/10 crt=8152'4311119 lcod 0'0 inactive NOTIFY NIBBLEWISE] enter Reset 0> 2017-10-26 00:08:34.013791 7f370854a8c0 -1 osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*, ceph::bufferlist*)' thread 7f370854a8c0 time 2017-10-26 00:08:34.012019 osd/PG.cc: 3066: FAILED assert(0 == "unable to open pg metadata") ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x562453806790] 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x642) [0x5624531c45e2] 3: (OSD::load_pgs()+0x75a) [0x5624531188aa] 4: (OSD::init()+0x2026) [0x562453123ca6] 5: (main()+0x2ef1) [0x562453095301] 6: (__libc_start_main()+0xf0) [0x7f37053aa830] 7: (_start()+0x29) [0x5624530d6b09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.3.log --- end dump of recent events --- 2017-10-26 00:08:34.024362 7f370854a8c0 -1 *** Caught signal (Aborted) ** in thread 7f370854a8c0 thread_name:ceph-osd ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (()+0x98653e) [0x56245370653e] 2: (()+0x11390) [0x7f3707423390] 3: (gsignal()+0x38) [0x7f37053bf428] 4: (abort()+0x16a) [0x7f37053c102a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56245380697b] 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x642) [0x5624531c45e2] 7: (OSD::load_pgs()+0x75a) [0x5624531188aa] 8: (OSD::init()+0x2026) [0x562453123ca6] 9: (main()+0x2ef1) [0x562453095301] 10: (__libc_start_main()+0xf0) [0x7f37053aa830] 11: (_start()+0x29) [0x5624530d6b09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2017-10-26 00:08:34.024362 7f370854a8c0 -1 *** Caught signal (Aborted) ** in thread 7f370854a8c0 thread_name:ceph-osd ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (()+0x98653e) [0x56245370653e] 2: (()+0x11390) [0x7f3707423390] 3: (gsignal()+0x38) [0x7f37053bf428] 4: (abort()+0x16a) [0x7f37053c102a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x26b) [0x56245380697b] 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, ceph::buffer::list*)+0x642) [0x5624531c45e2] 7: (OSD::load_pgs()+0x75a) [0x5624531188aa] 8: (OSD::init()+0x2026) [0x562453123ca6] 9: (main()+0x2ef1) [0x562453095301] 10: (__libc_start_main()+0xf0) [0x7f37053aa830] 11: (_start()+0x29) [0x5624530d6b09] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.3.log --- end dump of recent events --- On 25/10/17 23:43, Christian Wuerdig wrote: > Well, there were a few bug logged around upgraded which hit a similar > assert but those were fixed 2 years ago supposedly. Looks like Ubuntu > 15.04 shipped Hammer (0.94.5) so presumably that's what you upgraded > from. > The current Jewel release is 10.2.10 - I don't know if the problem > you're seeing is fixed in there but I'd upgrade to 10.2.10 and then > open a tracker ticket if the problem still persists. > > On Thu, Oct 26, 2017 at 9:13 AM, Gonzalo Aguilar Delgado > <[email protected]> wrote: >> Hello, >> >> I cannot tell what was the previous version since I used the one installed >> on ubuntu 15.04. Now 16.04. >> >> But what I can tell is that I get errors from ceph osd and mon from time to >> time. The mon problems are scaring since I have to wipe the monitor and then >> reinstall a new one. I cannot really understand what's going on. I have >> never so many problems like after updating. >> >> Should I open a bug report? >> >> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x80) [0x55d5d510b250] >> 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2] >> 3: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a] >> 4: (OSD::init()+0x2026) [0x55d5d4a3ec46] >> 5: (main()+0x2d6b) [0x55d5d49b193b] >> 6: (__libc_start_main()+0xf0) [0x7f49d02e5830] >> 7: (_start()+0x29) [0x55d5d49f28c9] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> --- logging levels --- >> 0/ 5 none >> 0/ 1 lockdep >> 0/ 1 context >> 1/ 1 crush >> 1/ 5 mds >> 1/ 5 mds_balancer >> 1/ 5 mds_locker >> 1/ 5 mds_log >> 1/ 5 mds_log_expire >> 1/ 5 mds_migrator >> 0/ 1 buffer >> 0/ 1 timer >> 0/ 1 filer >> 0/ 1 striper >> 0/ 1 objecter >> 0/ 5 rados >> 0/ 5 rbd >> 0/ 5 rbd_mirror >> 0/ 5 rbd_replay >> 0/ 5 journaler >> 0/ 5 objectcacher >> 0/ 5 client >> 0/ 5 osd >> 0/ 5 optracker >> 0/ 5 objclass >> 1/ 3 filestore >> 1/ 3 journal >> 0/ 5 ms >> 1/ 5 mon >> 0/10 monc >> 1/ 5 paxos >> 0/ 5 tp >> 1/ 5 auth >> 1/ 5 crypto >> 1/ 1 finisher >> 1/ 5 heartbeatmap >> 1/ 5 perfcounter >> 1/ 5 rgw >> 1/10 civetweb >> 1/ 5 javaclient >> 1/ 5 asok >> 1/ 1 throttle >> 0/ 0 refs >> 1/ 5 xio >> 1/ 5 compressor >> 1/ 5 newstore >> 1/ 5 bluestore >> 1/ 5 bluefs >> 1/ 3 bdev >> 1/ 5 kstore >> 4/ 5 rocksdb >> 4/ 5 leveldb >> 1/ 5 kinetic >> 1/ 5 fuse >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 10000 >> max_new 1000 >> log_file /var/log/ceph/ceph-osd.3.log >> --- end dump of recent events --- >> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal (Aborted) ** >> in thread 7f49d36958c0 thread_name:ceph-osd >> >> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) >> 1: (()+0x9616ee) [0x55d5d500b6ee] >> 2: (()+0x11390) [0x7f49d235e390] >> 3: (gsignal()+0x38) [0x7f49d02fa428] >> 4: (abort()+0x16a) [0x7f49d02fc02a] >> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x26b) [0x55d5d510b43b] >> 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2] >> 7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a] >> 8: (OSD::init()+0x2026) [0x55d5d4a3ec46] >> 9: (main()+0x2d6b) [0x55d5d49b193b] >> 10: (__libc_start_main()+0xf0) [0x7f49d02e5830] >> 11: (_start()+0x29) [0x55d5d49f28c9] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> --- begin dump of recent events --- >> 0> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal >> (Aborted) ** >> in thread 7f49d36958c0 thread_name:ceph-osd >> >> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) >> 1: (()+0x9616ee) [0x55d5d500b6ee] >> 2: (()+0x11390) [0x7f49d235e390] >> 3: (gsignal()+0x38) [0x7f49d02fa428] >> 4: (abort()+0x16a) [0x7f49d02fc02a] >> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x26b) [0x55d5d510b43b] >> 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*, >> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2] >> 7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a] >> 8: (OSD::init()+0x2026) [0x55d5d4a3ec46] >> 9: (main()+0x2d6b) [0x55d5d49b193b] >> 10: (__libc_start_main()+0xf0) [0x7f49d02e5830] >> 11: (_start()+0x29) [0x55d5d49f28c9] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> --- logging levels --- >> 0/ 5 none >> 0/ 1 lockdep >> 0/ 1 context >> 1/ 1 crush >> 1/ 5 mds >> 1/ 5 mds_balancer >> 1/ 5 mds_locker >> 1/ 5 mds_log >> 1/ 5 mds_log_expire >> 1/ 5 mds_migrator >> 0/ 1 buffer >> 0/ 1 timer >> 0/ 1 filer >> 0/ 1 striper >> 0/ 1 objecter >> 0/ 5 rados >> 0/ 5 rbd >> 0/ 5 rbd_mirror >> 0/ 5 rbd_replay >> 0/ 5 journaler >> 0/ 5 objectcacher >> 0/ 5 client >> 0/ 5 osd >> 0/ 5 optracker >> 0/ 5 objclass >> 1/ 3 filestore >> 1/ 3 journal >> 0/ 5 ms >> 1/ 5 mon >> 0/10 monc >> 1/ 5 paxos >> 0/ 5 tp >> 1/ 5 auth >> 1/ 5 crypto >> 1/ 1 finisher >> 1/ 5 heartbeatmap >> 1/ 5 perfcounter >> 1/ 5 rgw >> 1/10 civetweb >> 1/ 5 javaclient >> 1/ 5 asok >> 1/ 1 throttle >> 0/ 0 refs >> 1/ 5 xio >> 1/ 5 compressor >> 1/ 5 newstore >> 1/ 5 bluestore >> 1/ 5 bluefs >> 1/ 3 bdev >> 1/ 5 kstore >> 4/ 5 rocksdb >> 4/ 5 leveldb >> 1/ 5 kinetic >> 1/ 5 fuse >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 10000 >> max_new 1000 >> log_file /var/log/ceph/ceph-osd.3.log >> - >> >> >> On 25/10/17 00:42, Christian Wuerdig wrote: >> >> >From which version of ceph to which other version of ceph did you >> upgrade? Can you provide logs from crashing OSDs? The degraded object >> percentage being larger than 100% has been reported before >> (https://www.spinics.net/lists/ceph-users/msg39519.html) and looks >> like it's been fixed a week or so ago: >> http://tracker.ceph.com/issues/21803 >> >> On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado >> <[email protected]> wrote: >> >> Hello, >> >> Since we upgraded ceph cluster we are facing a lot of problems. Most of them >> due to osd crashing. What can cause this? >> >> >> This morning I woke up with thi message: >> >> >> root@red-compute:~# ceph -w >> cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 >> health HEALTH_ERR >> 1 pgs are stuck inactive for more than 300 seconds >> 7 pgs inconsistent >> 1 pgs stale >> 1 pgs stuck stale >> recovery 20266198323167232/287940 objects degraded >> (7038340738753.641%) >> 37154696925806626 scrub errors >> too many PGs per OSD (305 > max 300) >> monmap e12: 2 mons at >> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0} >> election epoch 4986, quorum 0,1 red-compute,blue-compute >> fsmap e913: 1/1/1 up {0=blue-compute=up:active} >> osdmap e8096: 5 osds: 5 up, 5 in >> flags require_jewel_osds >> pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects >> 1119 GB used, 3060 GB / 4179 GB avail >> 20266198323167232/287940 objects degraded (7038340738753.641%) >> 756 active+clean >> 7 active+clean+inconsistent >> 1 stale+active+clean >> client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr >> >> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7 >> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB >> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39 >> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%) >> >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
