Re: [ceph-users] PG Stuck EC Pool

Ashley Merrick Wed, 12 Jul 2017 00:27:31 -0700

Is this planned to be merged into Luminous at some point?

,Ashley


From: Gregory Farnum [mailto:[email protected]]
Sent: Tuesday, 6 June 2017 2:24 AM
To: Ashley Merrick <[email protected]>; [email protected]
Cc: David Zafman <[email protected]>
Subject: Re: [ceph-users] PG Stuck EC Pool

It looks to me like this is related to http://tracker.ceph.com/issues/18162.

You might see if they came up with good resolution steps, and it looks like 
David is working on it in master but hasn't finished it yet.
-Greg

On Sat, Jun 3, 2017 at 2:47 AM Ashley Merrick 
<[email protected]<mailto:[email protected]>> wrote:
Attaching with logging to level 20.

After repeat attempts by removing nobackfill I have got it down to:


            recovery 31892/272325586 objects degraded (0.012%)
            recovery 2/272325586 objects misplaced (0.000%)

However any further attempts after removing nobackfill just causes an instant 
crash on 83 & 84, at this point I feel there is some corruption on the 
remaining 11 OSD’s of the PG however the error’s aren’t directly saying that, 
however always end the crash with:

-1 *** Caught signal (Aborted) ** in thread 7f716e862700 
thread_name:tp_osd_recov

,Ashley

From: ceph-users 
[mailto:[email protected]<mailto:[email protected]>]
 On Behalf Of Ashley Merrick
Sent: 03 June 2017 17:14
To: [email protected]<mailto:[email protected]>
Subject: Re: [ceph-users] PG Stuck EC Pool


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing>

Feedback<http://aka.ms/SafetyTipsFeedback>

I have now done some further testing and seeing these errors on 84 / 83 the 
OSD’s that crash while backfilling to 10,11

   -60> 2017-06-03 10:08:56.651768 7f6f76714700  1 -- 
172.16.3.14:6823/2694<http://172.16.3.14:6823/2694> <== osd.3 
172.16.2.101:0/25361<http://172.16.2.101:0/25361> 10 ==== osd_ping(ping e71688 
stamp 2017-06-03 10:08:56.652035) v2 ==== 47+0+0 (1097709006 0 0) 
0x5569ea88d400 con 0x5569e900e300
   -59> 2017-06-03 10:08:56.651804 7f6f76714700  1 -- 
172.16.3.14:6823/2694<http://172.16.3.14:6823/2694> --> 
172.16.2.101:0/25361<http://172.16.2.101:0/25361> -- osd_ping(ping_reply e71688 
stamp 2017-06-03 10:08:56.652035) v2 -- ?+0 0x5569e985fc00 con 0x5569e900e300
    -6> 2017-06-03 10:08:56.937156 7f6f5ee4d700  1 -- 
172.16.3.14:6822/2694<http://172.16.3.14:6822/2694> <== osd.53 
172.16.3.7:6816/15230<http://172.16.3.7:6816/15230> 13 ==== 
MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) v1 ==== 
148+0+0 (2355392791 0 0) 0x5569e8b22080 con 0x5569e9538f00
    -5> 2017-06-03 10:08:56.937193 7f6f5ee4d700  5 -- op tracker -- seq: 2409, 
time: 2017-06-03 10:08:56.937193, event: queued_for_pg, op: 
MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0))
    -4> 2017-06-03 10:08:56.937241 7f6f8ef8a700  5 -- op tracker -- seq: 2409, 
time: 2017-06-03 10:08:56.937240, event: reached_pg, op: 
MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0))
    -3> 2017-06-03 10:08:56.937266 7f6f8ef8a700  0 osd.83 pg_epoch: 71688 
pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 
ec=31534 les/c/f 71688/69510/67943 71687/71687/71687) 
[11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46]
 r=3 lpr=71687 pi=47065-71686/711 rops=1 bft=10(1),11(0) crt=71629'35509 mlcod 
0'0 active+undersized+degraded+remapped+inconsistent+backfilling NIBBLEWISE] 
failed_push 6:28170432:::rbd_data.e3d8852ae8944a.0000000000047d28:head from 
shard 53(8), reps on  unfound? 0
    -2> 2017-06-03 10:08:56.937346 7f6f8ef8a700  5 -- op tracker -- seq: 2409, 
time: 2017-06-03 10:08:56.937345, event: done, op: MOSDECSubOpReadReply(6.14s3 
71688 ECSubReadReply(tid=83, attrs_read=0))
    -1> 2017-06-03 10:08:56.937351 7f6f89f80700 -1 osd.83 pg_epoch: 71688 
pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 
ec=31534 les/c/f 71688/69510/67943 71687/71687/71687) 
[11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46]
 r=3 lpr=71687 pi=47065-71686/711 bft=10(1),11(0) crt=71629'35509 mlcod 0'0 
active+undersized+degraded+remapped+inconsistent+backfilling NIBBLEWISE] 
recover_replicas: object added to missing set for backfill, but is not in 
recovering, error!
   -42> 2017-06-03 10:08:56.968433 7f6f5f04f700  1 -- 
172.16.2.114:6822/2694<http://172.16.2.114:6822/2694> <== client.22857445 
172.16.2.212:0/2238053329<http://172.16.2.212:0/2238053329> 56 ==== 
osd_op(client.22857445.1:759236283 2.e732321d 
rbd_data.61b4c6238e1f29.000000000001ea27 [set-alloc-hint object_size 4194304 
write_size 4194304,write 126976~45056] snapc 0=[] ondisk+write e71688) v4 ==== 
217+0+45056 (2626314663 0 3883338397) 0x5569ea886b00 con 0x5569ea99c880

From: Ashley Merrick
Sent: 03 June 2017 14:27
To: '[email protected]<mailto:[email protected]>' 
<[email protected]<mailto:[email protected]>>
Subject: RE: PG Stuck EC Pool

From this extract from pg query:

"up": [
                    11,
                    10,
                    84,
                    83,
                    22,
                    26,
                    69,
                    72,
                    53,
                    59,
                    8,
                    4,
                    46
                ],
                "acting": [
                    2147483647<tel:(214)%20748-3647>,
                    2147483647<tel:(214)%20748-3647>,
                    84,
                    83,
                    22,
                    26,
                    69,
                    72,
                    53,
                    59,
                    8,
                    4,
                    46

I am wondering if there is an issue on 11 , 10 causing the current active 
primary “acting_primar": 84” to crash.

But can’t see anything that could be causing it.

,Ashley

From: Ashley Merrick
Sent: 01 June 2017 23:39
To: [email protected]<mailto:[email protected]>
Subject: RE: PG Stuck EC Pool

Have attached the full pg query for the effected PG encase this shows anything 
of interest.

Thanks

From: ceph-users [mailto:[email protected]] On Behalf Of Ashley 
Merrick
Sent: 01 June 2017 17:19
To: [email protected]<mailto:[email protected]>
Subject: [ceph-users] PG Stuck EC Pool


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing<http://aka.ms/LearnAboutSpoofing>

Feedback<http://aka.ms/SafetyTipsFeedback>


Have a PG which is stuck in this state (Is an EC with K=10 M=3)





pg 6.14 is active+undersized+degraded+remapped+inconsistent+backfilling, acting 
[2147483647<tel:(214)%20748-3647>,2147483647<tel:(214)%20748-3647>,84,83,22,26,69,72,53,59,8,4,46]



Currently have no-recover set, if I unset no recover both OSD 83 + 84 start to 
flap and go up and down, I see the following in the log's of the OSD.



*****
    -5> 2017-06-01 10:08:29.658593 7f430ec97700  1 -- 
172.16.3.14:6806/5204<http://172.16.3.14:6806/5204> <== osd.17 
172.16.3.3:6806/2006016<http://172.16.3.3:6806/2006016> 57 ==== 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1)) v1 ==== 67+0+0 (245959818 0 0) 0x563c9db7be00 con 
0x563c9cfca480
    -4> 2017-06-01 10:08:29.658620 7f430ec97700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658620, event: queued_for_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
    -3> 2017-06-01 10:08:29.658649 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658649, event: reached_pg, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
    -2> 2017-06-01 10:08:29.658661 7f4319e11700  5 -- op tracker -- seq: 2367, 
time: 2017-06-01 10:08:29.658660, event: done, op: 
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, 
committed=0, applied=1))
    -1> 2017-06-01 10:08:29.663107 7f43320ec700  5 -- op tracker -- seq: 2317, 
time: 2017-06-01 10:08:29.663107, event: sub_op_applied, op: 
osd_op(osd.79.66617:8675008 6.82058b1a rbd_data.e5208a238e1f29.0000000000025f3e 
[copy-from ver 4678410] snapc 0=[] 
ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e71513)
     0> 2017-06-01 10:08:29.663474 7f4319610700 -1 *** Caught signal (Aborted) 
**
 in thread 7f4319610700 thread_name:tp_osd_recov

 ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
 1: (()+0x9564a7) [0x563c6a6f24a7]
 2: (()+0xf890) [0x7f4342308890]
 3: (gsignal()+0x37) [0x7f434034f067]
 4: (abort()+0x148) [0x7f4340350448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x563c6a7f83d6]
 6: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x62f) 
[0x563c6a2850ff]
 7: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xa8a) 
[0x563c6a2b878a]
 8: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x36d) [0x563c6a131bbd]
 9: (ThreadPool::WorkQueue<PG>::_void_process(void*, 
ThreadPool::TPHandle&)+0x1d) [0x563c6a17c88d]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x563c6a7e8e3f]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x563c6a7e9d70]
 12: (()+0x8064) [0x7f4342301064]
 13: (clone()+0x6d) [0x7f434040262d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.
*****




What should my next steps be?



Thanks!
_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG Stuck EC Pool

Reply via email to