Hi all,

I have a production cluster, I recently purged all snaps.

Now on a set of OSD's when they backfill im getting an assert like the below : 

    -4> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit 
Started
/Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.000064
    -3> 2019-08-13 00:25:14.577 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter 
Started/Primary/Active/Backfilling
    -2> 2019-08-13 00:25:14.653 7ff4637b1700  5 osd.99 pg_epoch: 206049 
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] 
local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f 
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046 
pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod 
206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80] 
backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.000000000000b4b8:head
    -1> 2019-08-13 00:25:14.757 7ff4637b1700 -1 
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t 
SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13 
00:25:14.759270
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED 
ceph_assert(clone_overlap.count(clone))

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x152) [0x55989e4a6450]
 2: (()+0x517628) [0x55989e4a6628]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 4: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7) 
[0x55989e6544d7]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 12: (()+0x7fa3) [0x7ff47f718fa3]
 13: (clone()+0x3f) [0x7ff47f2c84cf]

     0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) **
 in thread 7ff4637b1700 thread_name:tp_osd_tp

 ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus 
(stable)
 1: (()+0x12730) [0x7ff47f723730]
 2: (gsignal()+0x10b) [0x7ff47f2067bb]
 3: (abort()+0x121) [0x7ff47f1f1535]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1a3) [0x55989e4a64a1]
 5: (()+0x517628) [0x55989e4a6628]
 6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
 7: 
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, 
pg_stat_t*)+0x297) [0x55989e7b2197]
 8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, 
bool*)+0xfdc) [0x55989e7e059c]
 9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, 
unsigned long*)+0x110b) [0x55989e7e468b]
 10: (OSD::do_recovery(PG*, unsigned int, unsigned long, 
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
 11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, 
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
 12: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x55989ec2ba74]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
 15: (()+0x7fa3) [0x7ff47f718fa3]
 16: (clone()+0x3f) [0x7ff47f2c84cf]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.



 FAILED ceph_assert(clone_overlap.count(clone)

if possible id like to 'nuke' this from the osd as there is no snap's active, 
however  would love some advise on the best way to go about this. 

best regards
Kevin Myers

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to