Hi Christian,

I've just upgraded to 10.2.10 and the problem still persist. Both. OSD
not starting (the most problematic now) and the wrong report of degraded
objects:

           20266198323226120/281736 objects degraded (7193329330730.229%)
 

Any ideas about how to resolve the problem with the OSD?

I checked xfs disk and seems ok. No disk errors. Smart says also it's okay.



    -2> 2017-10-26 00:08:34.011845 7f370854a8c0  5 osd.3 pg_epoch: 8152
pg[9.6( v 8152'4311119 (8063'4308045,8152'4311119] local-les=8152 n=282
ec=417 les/c/f 8152/8152/0 8150/8150/8118) [2,3] r=1 lpr=0
pi=8115-8149/10 crt=8152'4311119 lcod 0'0 inactive NOTIFY NIBBLEWISE]
exit Initial 0.012641 0 0.000000
    -1> 2017-10-26 00:08:34.011877 7f370854a8c0  5 osd.3 pg_epoch: 8152
pg[9.6( v 8152'4311119 (8063'4308045,8152'4311119] local-les=8152 n=282
ec=417 les/c/f 8152/8152/0 8150/8150/8118) [2,3] r=1 lpr=0
pi=8115-8149/10 crt=8152'4311119 lcod 0'0 inactive NOTIFY NIBBLEWISE]
enter Reset
     0> 2017-10-26 00:08:34.013791 7f370854a8c0 -1 osd/PG.cc: In
function 'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f370854a8c0 time 2017-10-26 00:08:34.012019
osd/PG.cc: 3066: FAILED assert(0 == "unable to open pg metadata")

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x562453806790]
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x642) [0x5624531c45e2]
 3: (OSD::load_pgs()+0x75a) [0x5624531188aa]
 4: (OSD::init()+0x2026) [0x562453123ca6]
 5: (main()+0x2ef1) [0x562453095301]
 6: (__libc_start_main()+0xf0) [0x7f37053aa830]
 7: (_start()+0x29) [0x5624530d6b09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---
2017-10-26 00:08:34.024362 7f370854a8c0 -1 *** Caught signal (Aborted) **
 in thread 7f370854a8c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x56245370653e]
 2: (()+0x11390) [0x7f3707423390]
 3: (gsignal()+0x38) [0x7f37053bf428]
 4: (abort()+0x16a) [0x7f37053c102a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x56245380697b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x642) [0x5624531c45e2]
 7: (OSD::load_pgs()+0x75a) [0x5624531188aa]
 8: (OSD::init()+0x2026) [0x562453123ca6]
 9: (main()+0x2ef1) [0x562453095301]
 10: (__libc_start_main()+0xf0) [0x7f37053aa830]
 11: (_start()+0x29) [0x5624530d6b09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- begin dump of recent events ---
     0> 2017-10-26 00:08:34.024362 7f370854a8c0 -1 *** Caught signal
(Aborted) **
 in thread 7f370854a8c0 thread_name:ceph-osd

 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (()+0x98653e) [0x56245370653e]
 2: (()+0x11390) [0x7f3707423390]
 3: (gsignal()+0x38) [0x7f37053bf428]
 4: (abort()+0x16a) [0x7f37053c102a]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x26b) [0x56245380697b]
 6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x642) [0x5624531c45e2]
 7: (OSD::load_pgs()+0x75a) [0x5624531188aa]
 8: (OSD::init()+0x2026) [0x562453123ca6]
 9: (main()+0x2ef1) [0x562453095301]
 10: (__libc_start_main()+0xf0) [0x7f37053aa830]
 11: (_start()+0x29) [0x5624530d6b09]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---


On 25/10/17 23:43, Christian Wuerdig wrote:
> Well, there were a few bug logged around upgraded which hit a similar
> assert but those were fixed 2 years ago supposedly. Looks like Ubuntu
> 15.04 shipped Hammer (0.94.5) so presumably that's what you upgraded
> from.
> The current Jewel release is 10.2.10 - I don't know if the problem
> you're seeing is fixed in there but I'd upgrade to 10.2.10 and then
> open a tracker ticket if the problem still persists.
>
> On Thu, Oct 26, 2017 at 9:13 AM, Gonzalo Aguilar Delgado
> <[email protected]> wrote:
>> Hello,
>>
>> I cannot tell what was the previous version since I used the one installed
>> on ubuntu 15.04. Now 16.04.
>>
>> But what I can tell is that I get errors from ceph osd and mon from time to
>> time. The mon problems are scaring since I have to wipe the monitor and then
>> reinstall a new one. I cannot really understand what's going on. I have
>> never so many problems like after updating.
>>
>> Should I open a bug report?
>>
>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x80) [0x55d5d510b250]
>>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
>> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>>  3: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>>  4: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>>  5: (main()+0x2d6b) [0x55d5d49b193b]
>>  6: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>>  7: (_start()+0x29) [0x55d5d49f28c9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 1 lockdep
>>    0/ 1 context
>>    1/ 1 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 1 buffer
>>    0/ 1 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 rbd_mirror
>>    0/ 5 rbd_replay
>>    0/ 5 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 5 osd
>>    0/ 5 optracker
>>    0/ 5 objclass
>>    1/ 3 filestore
>>    1/ 3 journal
>>    0/ 5 ms
>>    1/ 5 mon
>>    0/10 monc
>>    1/ 5 paxos
>>    0/ 5 tp
>>    1/ 5 auth
>>    1/ 5 crypto
>>    1/ 1 finisher
>>    1/ 5 heartbeatmap
>>    1/ 5 perfcounter
>>    1/ 5 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    1/ 5 asok
>>    1/ 1 throttle
>>    0/ 0 refs
>>    1/ 5 xio
>>    1/ 5 compressor
>>    1/ 5 newstore
>>    1/ 5 bluestore
>>    1/ 5 bluefs
>>    1/ 3 bdev
>>    1/ 5 kstore
>>    4/ 5 rocksdb
>>    4/ 5 leveldb
>>    1/ 5 kinetic
>>    1/ 5 fuse
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.3.log
>> --- end dump of recent events ---
>> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal (Aborted) **
>>  in thread 7f49d36958c0 thread_name:ceph-osd
>>
>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>  1: (()+0x9616ee) [0x55d5d500b6ee]
>>  2: (()+0x11390) [0x7f49d235e390]
>>  3: (gsignal()+0x38) [0x7f49d02fa428]
>>  4: (abort()+0x16a) [0x7f49d02fc02a]
>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x26b) [0x55d5d510b43b]
>>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
>> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>>  7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>>  8: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>>  9: (main()+0x2d6b) [0x55d5d49b193b]
>>  10: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>>  11: (_start()+0x29) [0x55d5d49f28c9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- begin dump of recent events ---
>>      0> 2017-10-25 22:09:58.778107 7f49d36958c0 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f49d36958c0 thread_name:ceph-osd
>>
>>  ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>>  1: (()+0x9616ee) [0x55d5d500b6ee]
>>  2: (()+0x11390) [0x7f49d235e390]
>>  3: (gsignal()+0x38) [0x7f49d02fa428]
>>  4: (abort()+0x16a) [0x7f49d02fc02a]
>>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x26b) [0x55d5d510b43b]
>>  6: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
>> ceph::buffer::list*)+0x642) [0x55d5d4ade2b2]
>>  7: (OSD::load_pgs()+0x75a) [0x55d5d4a3383a]
>>  8: (OSD::init()+0x2026) [0x55d5d4a3ec46]
>>  9: (main()+0x2d6b) [0x55d5d49b193b]
>>  10: (__libc_start_main()+0xf0) [0x7f49d02e5830]
>>  11: (_start()+0x29) [0x55d5d49f28c9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 1 lockdep
>>    0/ 1 context
>>    1/ 1 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 1 buffer
>>    0/ 1 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 rbd_mirror
>>    0/ 5 rbd_replay
>>    0/ 5 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 5 osd
>>    0/ 5 optracker
>>    0/ 5 objclass
>>    1/ 3 filestore
>>    1/ 3 journal
>>    0/ 5 ms
>>    1/ 5 mon
>>    0/10 monc
>>    1/ 5 paxos
>>    0/ 5 tp
>>    1/ 5 auth
>>    1/ 5 crypto
>>    1/ 1 finisher
>>    1/ 5 heartbeatmap
>>    1/ 5 perfcounter
>>    1/ 5 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    1/ 5 asok
>>    1/ 1 throttle
>>    0/ 0 refs
>>    1/ 5 xio
>>    1/ 5 compressor
>>    1/ 5 newstore
>>    1/ 5 bluestore
>>    1/ 5 bluefs
>>    1/ 3 bdev
>>    1/ 5 kstore
>>    4/ 5 rocksdb
>>    4/ 5 leveldb
>>    1/ 5 kinetic
>>    1/ 5 fuse
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.3.log
>> -
>>
>>
>> On 25/10/17 00:42, Christian Wuerdig wrote:
>>
>> >From which version of ceph to which other version of ceph did you
>> upgrade? Can you provide logs from crashing OSDs? The degraded object
>> percentage being larger than 100% has been reported before
>> (https://www.spinics.net/lists/ceph-users/msg39519.html) and looks
>> like it's been fixed a week or so ago:
>> http://tracker.ceph.com/issues/21803
>>
>> On Mon, Oct 23, 2017 at 5:10 AM, Gonzalo Aguilar Delgado
>> <[email protected]> wrote:
>>
>> Hello,
>>
>> Since we upgraded ceph cluster we are facing a lot of problems. Most of them
>> due to osd crashing. What can cause this?
>>
>>
>> This morning I woke up with thi message:
>>
>>
>> root@red-compute:~# ceph -w
>>     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
>>      health HEALTH_ERR
>>             1 pgs are stuck inactive for more than 300 seconds
>>             7 pgs inconsistent
>>             1 pgs stale
>>             1 pgs stuck stale
>>             recovery 20266198323167232/287940 objects degraded
>> (7038340738753.641%)
>>             37154696925806626 scrub errors
>>             too many PGs per OSD (305 > max 300)
>>      monmap e12: 2 mons at
>> {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
>>             election epoch 4986, quorum 0,1 red-compute,blue-compute
>>       fsmap e913: 1/1/1 up {0=blue-compute=up:active}
>>      osdmap e8096: 5 osds: 5 up, 5 in
>>             flags require_jewel_osds
>>       pgmap v68755349: 764 pgs, 6 pools, 558 GB data, 140 kobjects
>>             1119 GB used, 3060 GB / 4179 GB avail
>>             20266198323167232/287940 objects degraded (7038340738753.641%)
>>                  756 active+clean
>>                    7 active+clean+inconsistent
>>                    1 stale+active+clean
>>   client io 1630 B/s rd, 552 kB/s wr, 0 op/s rd, 64 op/s wr
>>
>> 2017-10-22 18:10:13.000812 mon.0 [INF] pgmap v68755348: 764 pgs: 7
>> active+clean+inconsistent, 756 active+clean, 1 stale+active+clean; 558 GB
>> data, 1119 GB used, 3060 GB / 4179 GB avail; 1641 B/s rd, 229 kB/s wr, 39
>> op/s; 20266198323167232/287940 objects degraded (7038340738753.641%)
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to