Re: [ceph-users] PG stuck peering after host reboot

george.vasilakakos Tue, 14 Feb 2017 02:28:56 -0800

Hi Brad,

I'll be doing so later in the day.


Thanks,

George
________________________________________
From: Brad Hubbard [[email protected]]
Sent: 13 February 2017 22:03
To: Vasilakakos, George (STFC,RAL,SC); Ceph Users
Subject: Re: [ceph-users] PG stuck peering after host reboot

I'd suggest creating a tracker and uploading a full debug log from the
primary so we can look at this in more detail.

On Mon, Feb 13, 2017 at 9:11 PM,  <[email protected]> wrote:
> Hi Brad,
>
> I could not tell you that as `ceph pg 1.323 query` never completes, it just 
> hangs there.
>
> On 11/02/2017, 00:40, "Brad Hubbard" <[email protected]> wrote:
>
>     On Thu, Feb 9, 2017 at 3:36 AM,  <[email protected]> wrote:
>     > Hi Corentin,
>     >
>     > I've tried that, the primary hangs when trying to injectargs so I set 
> the option in the config file and restarted all OSDs in the PG, it came up 
> with:
>     >
>     > pg 1.323 is remapped+peering, acting 
> [595,1391,2147483647,127,937,362,267,320,7,634,716]
>     >
>     > Still can't query the PG, no error messages in the logs of osd.240.
>     > The logs on osd.595 and osd.7 still fill up with the same messages.
>
>     So what does "peering_blocked_by_detail" show in that case since it
>     can no longer show "peering_blocked_by_history_les_bound"?
>
>     >
>     > Regards,
>     >
>     > George
>     > ________________________________
>     > From: Corentin Bonneton [[email protected]]
>     > Sent: 08 February 2017 16:31
>     > To: Vasilakakos, George (STFC,RAL,SC)
>     > Cc: [email protected]
>     > Subject: Re: [ceph-users] PG stuck peering after host reboot
>     >
>     > Hello,
>     >
>     > I already had the case, I applied the parameter 
> (osd_find_best_info_ignore_history_les) to all the osd that have reported the 
> queries blocked.
>     >
>     > --
>     > Cordialement,
>     > CEO FEELB | Corentin BONNETON
>     > [email protected]<mailto:[email protected]>
>     >
>     > Le 8 févr. 2017 à 17:17, 
> [email protected]<mailto:[email protected]> a écrit :
>     >
>     > Hi Ceph folks,
>     >
>     > I have a cluster running Jewel 10.2.5 using a mix EC and replicated 
> pools.
>     >
>     > After rebooting a host last night, one PG refuses to complete peering
>     >
>     > pg 1.323 is stuck inactive for 73352.498493, current state peering, 
> last acting [595,1391,240,127,937,362,267,320,7,634,716]
>     >
>     > Restarting OSDs or hosts does nothing to help, or sometimes results in 
> things like this:
>     >
>     > pg 1.323 is remapped+peering, acting 
> [2147483647,1391,240,127,937,362,267,320,7,634,716]
>     >
>     >
>     > The host that was rebooted is home to osd.7 (8). If I go onto it to 
> look at the logs for osd.7 this is what I see:
>     >
>     > $ tail -f /var/log/ceph/ceph-osd.7.log
>     > 2017-02-08 15:41:00.445247 7f5fcc2bd700  0 -- 
> XXX.XXX.XXX.172:6905/20510 >> XXX.XXX.XXX.192:6921/55371 pipe(0x7f6074a0b400 
> sd=34 :42828 s=2 pgs=319 cs=471 l=0 c=0x7f6070086700).fault, initiating 
> reconnect
>     >
>     > I'm assuming that in IP1:port1/PID1 >> IP2:port2/PID2 the >> indicates 
> the direction of communication. I've traced these to osd.7 (rank 8 in the 
> stuck PG) reaching out to osd.595 (the primary in the stuck PG).
>     >
>     > Meanwhile, looking at the logs of osd.595 I see this:
>     >
>     > $ tail -f /var/log/ceph/ceph-osd.595.log
>     > 2017-02-08 15:41:15.760708 7f1765673700  0 -- 
> XXX.XXX.XXX.192:6921/55371 >> XXX.XXX.XXX.172:6905/20510 pipe(0x7f17b2911400 
> sd=101 :6921 s=0 pgs=0 cs=0 l=0 c=0x7f17b7beaf00).accept connect_seq 478 vs 
> existing 477 state standby
>     > 2017-02-08 15:41:20.768844 7f1765673700  0 bad crc in front 1941070384 
> != exp 3786596716
>     >
>     > which again shows osd.595 reaching out to osd.7 and from what I could 
> gather the CRC problem is about messaging.
>     >
>     > Google searching has yielded nothing particularly useful on how to get 
> this unstuck.
>     >
>     > ceph pg 1.323 query seems to hang forever but it completed once last 
> night and I noticed this:
>     >
>     >            "peering_blocked_by_detail": [
>     >                {
>     >                    "detail": "peering_blocked_by_history_les_bound"
>     >                }
>     >
>     > We have seen this before and it was cleared by setting 
> osd_find_best_info_ignore_history_les to true for the first two OSDs on the 
> stuck PGs (this was on a 3 replica pool). This hasn't worked in this case and 
> I suspect the option needs to be set on either a majority of OSDs or enough k 
> number of OSDs to be able to use their data and ignore history.
>     >
>     > We would really appreciate any guidance and/or help the community can 
> offer!
>     >
>     > _______________________________________________
>     > ceph-users mailing list
>     > [email protected]
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>     --
>     Cheers,
>     Brad
>
>



--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG stuck peering after host reboot

Reply via email to