Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

Kevin Olbrich Thu, 17 May 2018 09:32:25 -0700

Hi!

@Paul
Thanks! I know, I read the whole topic about size 2 some months ago. But
this has not been my decision, I had to set it up like that.


In the meantime, I did a reboot of node1001 and node1002 with flag "noout"
set and now peering has finished and only 0.0x% are rebalanced.
IO is flowing again. This happend as soon as the OSD was down (not out).

This looks very much like a bug for me, isn't it? Restarting an OSD to
"repair" crush?
Also I did query the pg but it did not show any error. It just lists stats
and that the pg was active since 8:40 this morning.
There are row(s) with "blocked by" but no value, is that supposed to be
filled with data?

Kind regards,
Kevin



2018-05-17 16:45 GMT+02:00 Paul Emmerich <[email protected]>:

> Check ceph pg query, it will (usually) tell you why something is stuck
> inactive.
>
> Also: never do min_size 1.
>
>
> Paul
>
>
> 2018-05-17 15:48 GMT+02:00 Kevin Olbrich <[email protected]>:
>
>> I was able to obtain another NVMe to get the HDDs in node1004 into the
>> cluster.
>> The number of disks (all 1TB) is now balanced between racks, still some
>> inactive PGs:
>>
>>   data:
>>     pools:   2 pools, 1536 pgs
>>     objects: 639k objects, 2554 GB
>>     usage:   5167 GB used, 14133 GB / 19300 GB avail
>>     pgs:     1.562% pgs not active
>>              1183/1309952 objects degraded (0.090%)
>>              199660/1309952 objects misplaced (15.242%)
>>              1072 active+clean
>>              405  active+remapped+backfill_wait
>>              35   active+remapped+backfilling
>>              21   activating+remapped
>>              3    activating+undersized+degraded+remapped
>>
>>
>>
>> ID  CLASS WEIGHT   TYPE NAME                     STATUS REWEIGHT PRI-AFF
>>  -1       18.85289 root default
>> -16       18.85289     datacenter dc01
>> -19       18.85289         pod dc01-agg01
>> -10        8.98700             rack dc01-rack02
>>  -4        4.03899                 host node1001
>>   0   hdd  0.90999                     osd.0         up  1.00000 1.00000
>>   1   hdd  0.90999                     osd.1         up  1.00000 1.00000
>>   5   hdd  0.90999                     osd.5         up  1.00000 1.00000
>>   2   ssd  0.43700                     osd.2         up  1.00000 1.00000
>>   3   ssd  0.43700                     osd.3         up  1.00000 1.00000
>>   4   ssd  0.43700                     osd.4         up  1.00000 1.00000
>>  -7        4.94899                 host node1002
>>   9   hdd  0.90999                     osd.9         up  1.00000 1.00000
>>  10   hdd  0.90999                     osd.10        up  1.00000 1.00000
>>  11   hdd  0.90999                     osd.11        up  1.00000 1.00000
>>  12   hdd  0.90999                     osd.12        up  1.00000 1.00000
>>   6   ssd  0.43700                     osd.6         up  1.00000 1.00000
>>   7   ssd  0.43700                     osd.7         up  1.00000 1.00000
>>   8   ssd  0.43700                     osd.8         up  1.00000 1.00000
>> -11        9.86589             rack dc01-rack03
>> -22        5.38794                 host node1003
>>  17   hdd  0.90999                     osd.17        up  1.00000 1.00000
>>  18   hdd  0.90999                     osd.18        up  1.00000 1.00000
>>  24   hdd  0.90999                     osd.24        up  1.00000 1.00000
>>  26   hdd  0.90999                     osd.26        up  1.00000 1.00000
>>  13   ssd  0.43700                     osd.13        up  1.00000 1.00000
>>  14   ssd  0.43700                     osd.14        up  1.00000 1.00000
>>  15   ssd  0.43700                     osd.15        up  1.00000 1.00000
>>  16   ssd  0.43700                     osd.16        up  1.00000 1.00000
>> -25        4.47795                 host node1004
>>  23   hdd  0.90999                     osd.23        up  1.00000 1.00000
>>  25   hdd  0.90999                     osd.25        up  1.00000 1.00000
>>  27   hdd  0.90999                     osd.27        up  1.00000 1.00000
>>  19   ssd  0.43700                     osd.19        up  1.00000 1.00000
>>  20   ssd  0.43700                     osd.20        up  1.00000 1.00000
>>  21   ssd  0.43700                     osd.21        up  1.00000 1.00000
>>  22   ssd  0.43700                     osd.22        up  1.00000 1.00000
>>
>>
>> Pools are size 2, min_size 1 during setup.
>>
>> The count of PGs in activate state are related to the weight of OSDs but
>> why are they failing to proceed to active+clean or active+remapped?
>>
>> Kind regards,
>> Kevin
>>
>> 2018-05-17 14:05 GMT+02:00 Kevin Olbrich <[email protected]>:
>>
>>> Ok, I just waited some time but I still got some "activating" issues:
>>>
>>>   data:
>>>     pools:   2 pools, 1536 pgs
>>>     objects: 639k objects, 2554 GB
>>>     usage:   5194 GB used, 11312 GB / 16506 GB avail
>>>     pgs:     7.943% pgs not active
>>>              5567/1309948 objects degraded (0.425%)
>>>              195386/1309948 objects misplaced (14.916%)
>>>              1147 active+clean
>>>              235  active+remapped+backfill_wait
>>> *             107  activating+remapped*
>>>              32   active+remapped+backfilling
>>> *             15   activating+undersized+degraded+remapped*
>>>
>>> I set these settings during runtime:
>>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
>>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
>>> ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800'
>>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>>>
>>> Sure, mon_max_pg_per_osd is oversized but this is just temporary.
>>> Calculated PGs per OSD is 200.
>>>
>>> I searched the net and the bugtracker but most posts suggest
>>> osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I
>>> got more stuck PGs.
>>>
>>> Any more hints?
>>>
>>> Kind regards.
>>> Kevin
>>>
>>> 2018-05-17 13:37 GMT+02:00 Kevin Olbrich <[email protected]>:
>>>
>>>> PS: Cluster currently is size 2, I used PGCalc on Ceph website which,
>>>> by default, will place 200 PGs on each OSD.
>>>> I read about the protection in the docs and later noticed that I better
>>>> had only placed 100 PGs.
>>>>
>>>>
>>>> 2018-05-17 13:35 GMT+02:00 Kevin Olbrich <[email protected]>:
>>>>
>>>>> Hi!
>>>>>
>>>>> Thanks for your quick reply.
>>>>> Before I read your mail, i applied the following conf to my OSDs:
>>>>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>>>>>
>>>>> Status is now:
>>>>>   data:
>>>>>     pools:   2 pools, 1536 pgs
>>>>>     objects: 639k objects, 2554 GB
>>>>>     usage:   5211 GB used, 11295 GB / 16506 GB avail
>>>>>     pgs:     7.943% pgs not active
>>>>>              5567/1309948 objects degraded (0.425%)
>>>>>              252327/1309948 objects misplaced (19.262%)
>>>>>              1030 active+clean
>>>>>              351  active+remapped+backfill_wait
>>>>>              107  activating+remapped
>>>>>              33   active+remapped+backfilling
>>>>>              15   activating+undersized+degraded+remapped
>>>>>
>>>>> A little bit better but still some non-active PGs.
>>>>> I will investigate your other hints!
>>>>>
>>>>> Thanks
>>>>> Kevin
>>>>>
>>>>> 2018-05-17 13:30 GMT+02:00 Burkhard Linke <
>>>>> [email protected]>:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote:
>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> Today I added some new OSDs (nearly doubled) to my luminous cluster.
>>>>>>> I then changed pg(p)_num from 256 to 1024 for that pool because it
>>>>>>> was
>>>>>>> complaining about to few PGs. (I noticed that should better have
>>>>>>> been small
>>>>>>> changes).
>>>>>>>
>>>>>>> This is the current status:
>>>>>>>
>>>>>>>      health: HEALTH_ERR
>>>>>>>              336568/1307562 objects misplaced (25.740%)
>>>>>>>              Reduced data availability: 128 pgs inactive, 3 pgs
>>>>>>> peering, 1
>>>>>>> pg stale
>>>>>>>              Degraded data redundancy: 6985/1307562 objects degraded
>>>>>>> (0.534%), 19 pgs degraded, 19 pgs undersized
>>>>>>>              107 slow requests are blocked > 32 sec
>>>>>>>              218 stuck requests are blocked > 4096 sec
>>>>>>>
>>>>>>>    data:
>>>>>>>      pools:   2 pools, 1536 pgs
>>>>>>>      objects: 638k objects, 2549 GB
>>>>>>>      usage:   5210 GB used, 11295 GB / 16506 GB avail
>>>>>>>      pgs:     0.195% pgs unknown
>>>>>>>               8.138% pgs not active
>>>>>>>               6985/1307562 objects degraded (0.534%)
>>>>>>>               336568/1307562 objects misplaced (25.740%)
>>>>>>>               855 active+clean
>>>>>>>               517 active+remapped+backfill_wait
>>>>>>>               107 activating+remapped
>>>>>>>               31  active+remapped+backfilling
>>>>>>>               15  activating+undersized+degraded+remapped
>>>>>>>               4   active+undersized+degraded+remapped+backfilling
>>>>>>>               3   unknown
>>>>>>>               3   peering
>>>>>>>               1   stale+active+clean
>>>>>>>
>>>>>>
>>>>>> You need to resolve the unknown/peering/activating pgs first. You
>>>>>> have 1536 PGs, assuming replication size 3 this make 4608 PG copies. 
>>>>>> Given
>>>>>> 25 OSDs and the heterogenous host sizes, I assume that some OSDs hold 
>>>>>> more
>>>>>> than 200 PGs. There's a threshold for the number of PGs; reaching this
>>>>>> threshold keeps the OSDs from accepting new PGs.
>>>>>>
>>>>>> Try to increase the threshold  (mon_max_pg_per_osd /
>>>>>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about
>>>>>> the exact one, consult the documentation) to allow more PGs on the OSDs. 
>>>>>> If
>>>>>> this is the cause of the problem, the peering and activating states 
>>>>>> should
>>>>>> be resolved within a short time.
>>>>>>
>>>>>> You can also check the number of PGs per OSD with 'ceph osd df'; the
>>>>>> last column is the current number of PGs.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> OSD tree:
>>>>>>>
>>>>>>> ID  CLASS WEIGHT   TYPE NAME                     STATUS REWEIGHT
>>>>>>> PRI-AFF
>>>>>>>   -1       16.12177 root default
>>>>>>> -16       16.12177     datacenter dc01
>>>>>>> -19       16.12177         pod dc01-agg01
>>>>>>> -10        8.98700             rack dc01-rack02
>>>>>>>   -4        4.03899                 host node1001
>>>>>>>    0   hdd  0.90999                     osd.0         up  1.00000
>>>>>>> 1.00000
>>>>>>>    1   hdd  0.90999                     osd.1         up  1.00000
>>>>>>> 1.00000
>>>>>>>    5   hdd  0.90999                     osd.5         up  1.00000
>>>>>>> 1.00000
>>>>>>>    2   ssd  0.43700                     osd.2         up  1.00000
>>>>>>> 1.00000
>>>>>>>    3   ssd  0.43700                     osd.3         up  1.00000
>>>>>>> 1.00000
>>>>>>>    4   ssd  0.43700                     osd.4         up  1.00000
>>>>>>> 1.00000
>>>>>>>   -7        4.94899                 host node1002
>>>>>>>    9   hdd  0.90999                     osd.9         up  1.00000
>>>>>>> 1.00000
>>>>>>>   10   hdd  0.90999                     osd.10        up  1.00000
>>>>>>> 1.00000
>>>>>>>   11   hdd  0.90999                     osd.11        up  1.00000
>>>>>>> 1.00000
>>>>>>>   12   hdd  0.90999                     osd.12        up  1.00000
>>>>>>> 1.00000
>>>>>>>    6   ssd  0.43700                     osd.6         up  1.00000
>>>>>>> 1.00000
>>>>>>>    7   ssd  0.43700                     osd.7         up  1.00000
>>>>>>> 1.00000
>>>>>>>    8   ssd  0.43700                     osd.8         up  1.00000
>>>>>>> 1.00000
>>>>>>> -11        7.13477             rack dc01-rack03
>>>>>>> -22        5.38678                 host node1003
>>>>>>>   17   hdd  0.90970                     osd.17        up  1.00000
>>>>>>> 1.00000
>>>>>>>   18   hdd  0.90970                     osd.18        up  1.00000
>>>>>>> 1.00000
>>>>>>>   24   hdd  0.90970                     osd.24        up  1.00000
>>>>>>> 1.00000
>>>>>>>   26   hdd  0.90970                     osd.26        up  1.00000
>>>>>>> 1.00000
>>>>>>>   13   ssd  0.43700                     osd.13        up  1.00000
>>>>>>> 1.00000
>>>>>>>   14   ssd  0.43700                     osd.14        up  1.00000
>>>>>>> 1.00000
>>>>>>>   15   ssd  0.43700                     osd.15        up  1.00000
>>>>>>> 1.00000
>>>>>>>   16   ssd  0.43700                     osd.16        up  1.00000
>>>>>>> 1.00000
>>>>>>> -25        1.74799                 host node1004
>>>>>>>   19   ssd  0.43700                     osd.19        up  1.00000
>>>>>>> 1.00000
>>>>>>>   20   ssd  0.43700                     osd.20        up  1.00000
>>>>>>> 1.00000
>>>>>>>   21   ssd  0.43700                     osd.21        up  1.00000
>>>>>>> 1.00000
>>>>>>>   22   ssd  0.43700                     osd.22        up  1.00000
>>>>>>> 1.00000
>>>>>>>
>>>>>>>
>>>>>>> Crush rule is set to chooseleaf rack and (temporary!) to size 2.
>>>>>>> Why are PGs stuck in peering and activating?
>>>>>>> "ceph df" shows that only 1,5TB are used on the pool, residing on
>>>>>>> the hdd's
>>>>>>> - which would perfectly fit the crush rule....(?)
>>>>>>>
>>>>>>
>>>>>> Size 2 within the crush rule or size 2 for the two pools?
>>>>>>
>>>>>> Regards,
>>>>>> Burkhard
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> <https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
> 81247 München
> <https://maps.google.com/?q=Freseniusstr.+31h+81247+M%C3%BCnchen&entry=gmail&source=g>
> www.croit.io
> Tel: +49 89 1896585 90
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

Reply via email to