Hi all,
I have a production cluster, I recently purged all snaps.
Now on a set of OSD's when they backfill im getting an assert like the below :
-4> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] exit
Started
/Primary/Active/WaitRemoteBackfillReserved 0.244929 1 0.000064
-3> 2019-08-13 00:25:14.577 7ff4637b1700 5 osd.99 pg_epoch: 206049
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641] local-lis/les
=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046
pi=
[205889,206046)/1 crt=206047'25372641 lcod 206047'25372640 mlcod
206047'25372640 active+undersized+remapped+backfill_wait mbc={} ps=80] enter
Started/Primary/Active/Backfilling
-2> 2019-08-13 00:25:14.653 7ff4637b1700 5 osd.99 pg_epoch: 206049
pg[0.12ed( v 206047'25372641 (199518'25369560,206047'25372641]
local-lis/les=206046/206047 n=1746 ec=117322/1362 lis/c 206046/193496 les/c/f
206047/206028/0 206045/206046/206045) [99,76]/[99] backfill=[76] r=0 lpr=206046
pi=[205889,206046)/1 rops=1 crt=206047'25372641 lcod 206047'25372640 mlcod
206047'25372640 active+undersized+remapped+backfilling mbc={} ps=80]
backfill_pos is 0:b74d67be:::rbd_data.dae7bc6b8b4567.000000000000b4b8:head
-1> 2019-08-13 00:25:14.757 7ff4637b1700 -1
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: In function 'uint64_t
SnapSet::get_clone_bytes(snapid_t) const' thread 7ff4637b1700 time 2019-08-13
00:25:14.759270
/root/sources/pve/ceph/ceph-14.2.1/src/osd/osd_types.cc: 5263: FAILED
ceph_assert(clone_overlap.count(clone))
ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x152) [0x55989e4a6450]
2: (()+0x517628) [0x55989e4a6628]
3: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
4:
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x55989e7b2197]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&,
bool*)+0xfdc) [0x55989e7e059c]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&,
unsigned long*)+0x110b) [0x55989e7e468b]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7d7)
[0x55989e6544d7]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x55989ec2ba74]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
12: (()+0x7fa3) [0x7ff47f718fa3]
13: (clone()+0x3f) [0x7ff47f2c84cf]
0> 2019-08-13 00:25:14.761 7ff4637b1700 -1 *** Caught signal (Aborted) **
in thread 7ff4637b1700 thread_name:tp_osd_tp
ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus
(stable)
1: (()+0x12730) [0x7ff47f723730]
2: (gsignal()+0x10b) [0x7ff47f2067bb]
3: (abort()+0x121) [0x7ff47f1f1535]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a3) [0x55989e4a64a1]
5: (()+0x517628) [0x55989e4a6628]
6: (SnapSet::get_clone_bytes(snapid_t) const+0xc2) [0x55989e880d62]
7:
(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>,
pg_stat_t*)+0x297) [0x55989e7b2197]
8: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&,
bool*)+0xfdc) [0x55989e7e059c]
9: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&,
unsigned long*)+0x110b) [0x55989e7e468b]
10: (OSD::do_recovery(PG*, unsigned int, unsigned long,
ThreadPool::TPHandle&)+0x302) [0x55989e639192]
11: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
ThreadPool::TPHandle&)+0x19) [0x55989e8d15d9]
12: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x7d7) [0x55989e6544d7]
13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4)
[0x55989ec2ba74]
14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55989ec2e470]
15: (()+0x7fa3) [0x7ff47f718fa3]
16: (clone()+0x3f) [0x7ff47f2c84cf]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
FAILED ceph_assert(clone_overlap.count(clone)
if possible id like to 'nuke' this from the osd as there is no snap's active,
however would love some advise on the best way to go about this.
best regards
Kevin Myers
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com