On Tue, Dec 12, 2017 at 12:33 PM Nick Fisk <[email protected] 
<mailto:[email protected]> > wrote:


> That doesn't look like an RBD object -- any idea who is
> "client.34720596.1:212637720"?

So I think these might be proxy ops from the cache tier, as there are also
block ops on one of the cache tier OSD's, but this time it actually lists
the object name. Block op on cache tier.

           "description": "osd_op(client.34720596.1:212637720 17.ae78c1cf
17:f3831e75:::rbd_data.15a5e20238e1f29.00000000000388ad:head [set-alloc-hint
object_size 4194304 write_size 4194304,write 2584576~16384] snapc 0=[]
RETRY=2 ondisk+retry+write+known_if_redirected e104841)",
            "initiated_at": "2017-12-12 16:25:32.435718",
            "age": 13996.681147,
            "duration": 13996.681203,
            "type_data": {
                "flag_point": "reached pg",
                "client_info": {
                    "client": "client.34720596",
                    "client_addr": "10.3.31.41:0/2600619462 
<http://10.3.31.41:0/2600619462> ",
                    "tid": 212637720

I'm a bit baffled at the moment what's going. The pg query (attached) is not
showing in the main status that it has been blocked from peering or that
there are any missing objects. I've tried restarting all OSD's I can see
relating to the PG in case they needed a bit of a nudge.

 

Did that fix anything? I don't see anything immediately obvious but I'm not 
practiced in quickly reading that pg state output.

 

What's the output of "ceph -s"?

 

Hi Greg,

 

No restarting OSD’s didn’t seem to help. But I did make some progress late last 
night. By stopping OSD.68 the cluster unlocks itself and IO can progress. 
However as soon as it starts back up, 0.1cf and a couple of other PG’s again 
get stuck in an activating state. If I out the OSD, either with it up or down, 
then some other PG’s seem to get hit by the same problem as CRUSH moves PG 
mappings around to other OSD’s.

 

So there definitely seems to be some sort of weird peering issue somewhere. I 
have seen a very similar issue before on this cluster where after running the 
crush reweight script to balance OSD utilization, the weight got set too low 
and PG’s were unable to peer. I’m not convinced this is what’s happening here 
as all the weights haven’t changed, but I’m intending to explore this further 
just in case.

 

With 68 down

    pgs:     1071783/48650631 objects degraded (2.203%)

             5923 active+clean

             399  active+undersized+degraded

             7    active+clean+scrubbing+deep

             7    active+clean+remapped

 

With it up

    pgs:     0.047% pgs not active

             67271/48651279 objects degraded (0.138%)

             15602/48651279 objects misplaced (0.032%)

             6051 active+clean

             273  active+recovery_wait+degraded

             4    active+clean+scrubbing+deep

             4    active+remapped+backfill_wait

            3    activating+remapped

1.      active+recovering+degraded

 

PG Dump

ceph pg dump | grep activatin

dumped all

2.389         0                  0        0         0       0           0 1500  
   1500           activating+remapped 2017-12-13 11:08:50.990526      
76271'34230    106239:160310 [68,60,58,59,29,23]         68 [62,60,58,59,29,23] 
            62      76271'34230 2017-12-13 09:00:08.359690      76271'34230 
2017-12-10 10:05:10.931366

0.1cf      3947                  0        0         0       0 16472186880 1577  
   1577           activating+remapped 2017-12-13 11:08:50.641034   
106236'7512915   106239:6176548           [34,68,8]         34           
[34,8,53]             34   106138'7512682 2017-12-13 10:27:37.400613   
106138'7512682 2017-12-13 10:27:37.400613

2.210         0                  0        0         0       0           0 1500  
   1500           activating+remapped 2017-12-13 11:08:50.686193      
76271'33304     106239:96797 [68,67,34,36,16,15]         68 [62,67,34,36,16,15] 
            62      76271'33304 2017-12-12 00:49:21.038437      76271'33304 
2017-12-10 16:05:12.751425

 

 


>
> On Tue, Dec 12, 2017 at 12:36 PM, Nick Fisk <[email protected] 
> <mailto:[email protected]> > wrote:
> > Does anyone know what this object (0.ae78c1cf) might be, it's not your
> > normal run of the mill RBD object and I can't seem to find it in the
> > pool using rados --all ls . It seems to be leaving the 0.1cf PG stuck
> > in an
> > activating+remapped state and blocking IO. Pool 0 is just a pure RBD
> > activating+pool
> > with a cache tier above it. There is no current mention of unfound
> > objects or any other obvious issues.
> >
> > There is some backfilling going on, on another OSD which was upgraded
> > to bluestore, which was when the issue started. But I can't see any
> > link in the PG dump with upgraded OSD. My only thought so far is to
> > wait for this backfilling to finish and then deep-scrub this PG and
> > see if that reveals anything?
> >
> > Thanks,
> > Nick
> >
> >  "description": "osd_op(client.34720596.1:212637720 0.1cf 0.ae78c1cf
> > (undecoded)
> > ondisk+retry+write+ignore_cache+ignore_overlay+known_if_redirected
> > e105014)",
> >             "initiated_at": "2017-12-12 17:10:50.030660",
> >             "age": 335.948290,
> >             "duration": 335.948383,
> >             "type_data": {
> >                 "flag_point": "delayed",
> >                 "events": [
> >                     {
> >                         "time": "2017-12-12 17:10:50.030660",
> >                         "event": "initiated"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030692",
> >                         "event": "queued_for_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030719",
> >                         "event": "reached_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.030727",
> >                         "event": "waiting for peered"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.197353",
> >                         "event": "reached_pg"
> >                     },
> >                     {
> >                         "time": "2017-12-12 17:10:50.197355",
> >                         "event": "waiting for peered"
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected] <mailto:[email protected]> 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> _______________________________________________
> ceph-users mailing list
> [email protected] <mailto:[email protected]> 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to