Hi! Thanks for your quick reply. Before I read your mail, i applied the following conf to my OSDs: ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
Status is now: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5211 GB used, 11295 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 252327/1309948 objects misplaced (19.262%) 1030 active+clean 351 active+remapped+backfill_wait 107 activating+remapped 33 active+remapped+backfilling 15 activating+undersized+degraded+remapped A little bit better but still some non-active PGs. I will investigate your other hints! Thanks Kevin 2018-05-17 13:30 GMT+02:00 Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de>: > Hi, > > > > On 05/17/2018 01:09 PM, Kevin Olbrich wrote: > >> Hi! >> >> Today I added some new OSDs (nearly doubled) to my luminous cluster. >> I then changed pg(p)_num from 256 to 1024 for that pool because it was >> complaining about to few PGs. (I noticed that should better have been >> small >> changes). >> >> This is the current status: >> >> health: HEALTH_ERR >> 336568/1307562 objects misplaced (25.740%) >> Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 >> pg stale >> Degraded data redundancy: 6985/1307562 objects degraded >> (0.534%), 19 pgs degraded, 19 pgs undersized >> 107 slow requests are blocked > 32 sec >> 218 stuck requests are blocked > 4096 sec >> >> data: >> pools: 2 pools, 1536 pgs >> objects: 638k objects, 2549 GB >> usage: 5210 GB used, 11295 GB / 16506 GB avail >> pgs: 0.195% pgs unknown >> 8.138% pgs not active >> 6985/1307562 objects degraded (0.534%) >> 336568/1307562 objects misplaced (25.740%) >> 855 active+clean >> 517 active+remapped+backfill_wait >> 107 activating+remapped >> 31 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> 4 active+undersized+degraded+remapped+backfilling >> 3 unknown >> 3 peering >> 1 stale+active+clean >> > > You need to resolve the unknown/peering/activating pgs first. You have > 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 > OSDs and the heterogenous host sizes, I assume that some OSDs hold more > than 200 PGs. There's a threshold for the number of PGs; reaching this > threshold keeps the OSDs from accepting new PGs. > > Try to increase the threshold (mon_max_pg_per_osd / > max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about > the exact one, consult the documentation) to allow more PGs on the OSDs. If > this is the cause of the problem, the peering and activating states should > be resolved within a short time. > > You can also check the number of PGs per OSD with 'ceph osd df'; the last > column is the current number of PGs. > > >> >> OSD tree: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 16.12177 root default >> -16 16.12177 datacenter dc01 >> -19 16.12177 pod dc01-agg01 >> -10 8.98700 rack dc01-rack02 >> -4 4.03899 host node1001 >> 0 hdd 0.90999 osd.0 up 1.00000 1.00000 >> 1 hdd 0.90999 osd.1 up 1.00000 1.00000 >> 5 hdd 0.90999 osd.5 up 1.00000 1.00000 >> 2 ssd 0.43700 osd.2 up 1.00000 1.00000 >> 3 ssd 0.43700 osd.3 up 1.00000 1.00000 >> 4 ssd 0.43700 osd.4 up 1.00000 1.00000 >> -7 4.94899 host node1002 >> 9 hdd 0.90999 osd.9 up 1.00000 1.00000 >> 10 hdd 0.90999 osd.10 up 1.00000 1.00000 >> 11 hdd 0.90999 osd.11 up 1.00000 1.00000 >> 12 hdd 0.90999 osd.12 up 1.00000 1.00000 >> 6 ssd 0.43700 osd.6 up 1.00000 1.00000 >> 7 ssd 0.43700 osd.7 up 1.00000 1.00000 >> 8 ssd 0.43700 osd.8 up 1.00000 1.00000 >> -11 7.13477 rack dc01-rack03 >> -22 5.38678 host node1003 >> 17 hdd 0.90970 osd.17 up 1.00000 1.00000 >> 18 hdd 0.90970 osd.18 up 1.00000 1.00000 >> 24 hdd 0.90970 osd.24 up 1.00000 1.00000 >> 26 hdd 0.90970 osd.26 up 1.00000 1.00000 >> 13 ssd 0.43700 osd.13 up 1.00000 1.00000 >> 14 ssd 0.43700 osd.14 up 1.00000 1.00000 >> 15 ssd 0.43700 osd.15 up 1.00000 1.00000 >> 16 ssd 0.43700 osd.16 up 1.00000 1.00000 >> -25 1.74799 host node1004 >> 19 ssd 0.43700 osd.19 up 1.00000 1.00000 >> 20 ssd 0.43700 osd.20 up 1.00000 1.00000 >> 21 ssd 0.43700 osd.21 up 1.00000 1.00000 >> 22 ssd 0.43700 osd.22 up 1.00000 1.00000 >> >> >> Crush rule is set to chooseleaf rack and (temporary!) to size 2. >> Why are PGs stuck in peering and activating? >> "ceph df" shows that only 1,5TB are used on the pool, residing on the >> hdd's >> - which would perfectly fit the crush rule....(?) >> > > Size 2 within the crush rule or size 2 for the two pools? > > Regards, > Burkhard > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com