Re: [ceph-users] Recovering from PG in down+incomplete state

Craig Lewis Fri, 19 Dec 2014 10:47:29 -0800

Why did you remove osd.7?

Something else appears to be wrong.  With all 11 OSDs up, you shouldn't
have any PGs stuck in stale or peering.



How badly are the clocks skewed between nodes?  If it's bad enough, it can
cause communication problems between nodes.  Ceph will complain if the
clocks are more than 50ms different. It's best if you run ntpd on all
nodes.

I'm thinking that cleaning up the clock skew will fix most of your issues.


If that does fix the issue, you can try bringing osd.7 back in.  Don't
reformat it, just deploy it as you normally would.  The CRUSHMAP will go
back to the way it was before you removed osd.7.  Ceph will start to
backfill+remap data onto the "new" osd, and see that most of it is already
there.  It should recovery relatively quickly... I think.


On Fri, Dec 19, 2014 at 10:28 AM, Robert LeBlanc <[email protected]>
wrote:
>
> I'm still pretty new at troubleshooting Ceph and since no one has
> responded yet I'll give a stab.
>
> What is the size of your pool?
> 'ceph osd pool get <pool name> size'
>
> It seems like based on the number of incomplete PGs that it was '1'. I
> understand that if you are able to bring osd 7 back in, it would clear up.
> I'm just not seeing a secondary osd for that PG.
>
> Disclaimer: I could be totally wrong.
>
> Robert LeBlanc
>
> On Thu, Dec 18, 2014 at 11:41 PM, Mallikarjun Biradar <
> [email protected]> wrote:
>
>> Hi all,
>>
>> I had 12 OSD's in my cluster with 2 OSD nodes. One of the OSD was in down
>> state, I have removed that PG from cluster, by removing crush rule for that
>> OSD.
>>
>> Now cluster with 11 OSD's, started rebalancing. After sometime, cluster
>> status was
>>
>> ems@rack6-client-5:~$ sudo ceph -s
>>     cluster eb5452f4-5ce9-4b97-9bfd-2a34716855f1
>>      health HEALTH_WARN 1 pgs down; 252 pgs incomplete; 10 pgs peering;
>> 73 pgs stale; 262 pgs stuck inactive; 73 pgs stuck stale; 262 pgs stuck
>> unclean; clock skew detected on mon.rack6-client-5, mon.rack6-client-6
>>      monmap e1: 3 mons at {rack6-client-4=
>> 10.242.43.105:6789/0,rack6-client-5=10.242.43.106:6789/0,rack6-client-6=10.242.43.107:6789/0},
>> election epoch 12, quorum 0,1,2 rack6-client-4,rack6-client-5,rack6-client-6
>>      osdmap e2648: 11 osds: 11 up, 11 in
>>       pgmap v554251: 846 pgs, 3 pools, 4383 GB data, 1095 kobjects
>>             11668 GB used, 26048 GB / 37717 GB avail
>>                   63 stale+active+clean
>>                    1 down+incomplete
>>                  521 active+clean
>>                  251 incomplete
>>                   10 stale+peering
>> ems@rack6-client-5:~$
>>
>>
>> To fix this, i cant run "ceph osd lost <osd.id>" to remove the PG which
>> is in down state. As OSD is already removed from the cluster.
>>
>> ems@rack6-client-4:~$ sudo ceph pg dump all | grep down
>> dumped all in format plain
>> 1.38    1548    0       0       0       0       6492782592      3001
>>  3001    down+incomplete 2014-12-18 15:58:29.681708      1118'508438
>> 2648:1073892    [6,3,1]     6       [6,3,1] 6       76'437184
>> 2014-12-16 12:38:35.322835      76'437184       2014-12-16 12:38:35.322835
>> ems@rack6-client-4:~$
>>
>> ems@rack6-client-4:~$ sudo ceph pg 1.38 query
>> .............
>> "recovery_state": [
>>         { "name": "Started\/Primary\/Peering",
>>           "enter_time": "2014-12-18 15:58:29.681666",
>>           "past_intervals": [
>>                 { "first": 1109,
>>                   "last": 1118,
>>                   "maybe_went_rw": 1,
>> ...................
>> ...................
>> "down_osds_we_would_probe": [
>>                 7],
>>           "peering_blocked_by": []},
>> ...................
>> ...................
>>
>> ems@rack6-client-4:~$ sudo ceph osd tree
>> # id    weight  type name       up/down reweight
>> -1      36.85   root default
>> -2      20.1            host rack2-storage-1
>> 0       3.35                    osd.0   up      1
>> 1       3.35                    osd.1   up      1
>> 2       3.35                    osd.2   up      1
>> 3       3.35                    osd.3   up      1
>> 4       3.35                    osd.4   up      1
>> 5       3.35                    osd.5   up      1
>> -3      16.75           host rack2-storage-5
>> 6       3.35                    osd.6   up      1
>> 8       3.35                    osd.8   up      1
>> 9       3.35                    osd.9   up      1
>> 10      3.35                    osd.10  up      1
>> 11      3.35                    osd.11  up      1
>> ems@rack6-client-4:~$ sudo ceph osd lost 7 --yes-i-really-mean-it
>> osd.7 is not down or doesn't exist
>> ems@rack6-client-4:~$
>>
>>
>> Can somebody suggest any other recovery step to come out of this?
>>
>> -Thanks & Regards,
>> Mallikarjun Biradar
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recovering from PG in down+incomplete state

Reply via email to