Ok, I just waited some time but I still got some "activating" issues:
data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5194 GB used, 11312 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 195386/1309948 objects misplaced (14.916%) 1147 active+clean 235 active+remapped+backfill_wait * 107 activating+remapped* 32 active+remapped+backfilling * 15 activating+undersized+degraded+remapped* I set these settings during runtime: ceph tell 'osd.*' injectargs '--osd-max-backfills 16' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' Sure, mon_max_pg_per_osd is oversized but this is just temporary. Calculated PGs per OSD is 200. I searched the net and the bugtracker but most posts suggest osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got more stuck PGs. Any more hints? Kind regards. Kevin 2018-05-17 13:37 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by > default, will place 200 PGs on each OSD. > I read about the protection in the docs and later noticed that I better > had only placed 100 PGs. > > > 2018-05-17 13:35 GMT+02:00 Kevin Olbrich <k...@sv01.de>: > >> Hi! >> >> Thanks for your quick reply. >> Before I read your mail, i applied the following conf to my OSDs: >> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >> >> Status is now: >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5211 GB used, 11295 GB / 16506 GB avail >> pgs: 7.943% pgs not active >> 5567/1309948 objects degraded (0.425%) >> 252327/1309948 objects misplaced (19.262%) >> 1030 active+clean >> 351 active+remapped+backfill_wait >> 107 activating+remapped >> 33 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> >> A little bit better but still some non-active PGs. >> I will investigate your other hints! >> >> Thanks >> Kevin >> >> 2018-05-17 13:30 GMT+02:00 Burkhard Linke <Burkhard.Linke@computational. >> bio.uni-giessen.de>: >> >>> Hi, >>> >>> >>> >>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >>> >>>> Hi! >>>> >>>> Today I added some new OSDs (nearly doubled) to my luminous cluster. >>>> I then changed pg(p)_num from 256 to 1024 for that pool because it was >>>> complaining about to few PGs. (I noticed that should better have been >>>> small >>>> changes). >>>> >>>> This is the current status: >>>> >>>> health: HEALTH_ERR >>>> 336568/1307562 objects misplaced (25.740%) >>>> Reduced data availability: 128 pgs inactive, 3 pgs >>>> peering, 1 >>>> pg stale >>>> Degraded data redundancy: 6985/1307562 objects degraded >>>> (0.534%), 19 pgs degraded, 19 pgs undersized >>>> 107 slow requests are blocked > 32 sec >>>> 218 stuck requests are blocked > 4096 sec >>>> >>>> data: >>>> pools: 2 pools, 1536 pgs >>>> objects: 638k objects, 2549 GB >>>> usage: 5210 GB used, 11295 GB / 16506 GB avail >>>> pgs: 0.195% pgs unknown >>>> 8.138% pgs not active >>>> 6985/1307562 objects degraded (0.534%) >>>> 336568/1307562 objects misplaced (25.740%) >>>> 855 active+clean >>>> 517 active+remapped+backfill_wait >>>> 107 activating+remapped >>>> 31 active+remapped+backfilling >>>> 15 activating+undersized+degraded+remapped >>>> 4 active+undersized+degraded+remapped+backfilling >>>> 3 unknown >>>> 3 peering >>>> 1 stale+active+clean >>>> >>> >>> You need to resolve the unknown/peering/activating pgs first. You have >>> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 >>> OSDs and the heterogenous host sizes, I assume that some OSDs hold more >>> than 200 PGs. There's a threshold for the number of PGs; reaching this >>> threshold keeps the OSDs from accepting new PGs. >>> >>> Try to increase the threshold (mon_max_pg_per_osd / >>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about >>> the exact one, consult the documentation) to allow more PGs on the OSDs. If >>> this is the cause of the problem, the peering and activating states should >>> be resolved within a short time. >>> >>> You can also check the number of PGs per OSD with 'ceph osd df'; the >>> last column is the current number of PGs. >>> >>> >>>> >>>> OSD tree: >>>> >>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>> -1 16.12177 root default >>>> -16 16.12177 datacenter dc01 >>>> -19 16.12177 pod dc01-agg01 >>>> -10 8.98700 rack dc01-rack02 >>>> -4 4.03899 host node1001 >>>> 0 hdd 0.90999 osd.0 up 1.00000 >>>> 1.00000 >>>> 1 hdd 0.90999 osd.1 up 1.00000 >>>> 1.00000 >>>> 5 hdd 0.90999 osd.5 up 1.00000 >>>> 1.00000 >>>> 2 ssd 0.43700 osd.2 up 1.00000 >>>> 1.00000 >>>> 3 ssd 0.43700 osd.3 up 1.00000 >>>> 1.00000 >>>> 4 ssd 0.43700 osd.4 up 1.00000 >>>> 1.00000 >>>> -7 4.94899 host node1002 >>>> 9 hdd 0.90999 osd.9 up 1.00000 >>>> 1.00000 >>>> 10 hdd 0.90999 osd.10 up 1.00000 >>>> 1.00000 >>>> 11 hdd 0.90999 osd.11 up 1.00000 >>>> 1.00000 >>>> 12 hdd 0.90999 osd.12 up 1.00000 >>>> 1.00000 >>>> 6 ssd 0.43700 osd.6 up 1.00000 >>>> 1.00000 >>>> 7 ssd 0.43700 osd.7 up 1.00000 >>>> 1.00000 >>>> 8 ssd 0.43700 osd.8 up 1.00000 >>>> 1.00000 >>>> -11 7.13477 rack dc01-rack03 >>>> -22 5.38678 host node1003 >>>> 17 hdd 0.90970 osd.17 up 1.00000 >>>> 1.00000 >>>> 18 hdd 0.90970 osd.18 up 1.00000 >>>> 1.00000 >>>> 24 hdd 0.90970 osd.24 up 1.00000 >>>> 1.00000 >>>> 26 hdd 0.90970 osd.26 up 1.00000 >>>> 1.00000 >>>> 13 ssd 0.43700 osd.13 up 1.00000 >>>> 1.00000 >>>> 14 ssd 0.43700 osd.14 up 1.00000 >>>> 1.00000 >>>> 15 ssd 0.43700 osd.15 up 1.00000 >>>> 1.00000 >>>> 16 ssd 0.43700 osd.16 up 1.00000 >>>> 1.00000 >>>> -25 1.74799 host node1004 >>>> 19 ssd 0.43700 osd.19 up 1.00000 >>>> 1.00000 >>>> 20 ssd 0.43700 osd.20 up 1.00000 >>>> 1.00000 >>>> 21 ssd 0.43700 osd.21 up 1.00000 >>>> 1.00000 >>>> 22 ssd 0.43700 osd.22 up 1.00000 >>>> 1.00000 >>>> >>>> >>>> Crush rule is set to chooseleaf rack and (temporary!) to size 2. >>>> Why are PGs stuck in peering and activating? >>>> "ceph df" shows that only 1,5TB are used on the pool, residing on the >>>> hdd's >>>> - which would perfectly fit the crush rule....(?) >>>> >>> >>> Size 2 within the crush rule or size 2 for the two pools? >>> >>> Regards, >>> Burkhard >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com