Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Hi! @Paul Thanks! I know, I read the whole topic about size 2 some months ago. But this has not been my decision, I had to set it up like that. In the meantime, I did a reboot of node1001 and node1002 with flag "noout" set and now peering has finished and only 0.0x% are rebalanced. IO is flowing again. This happend as soon as the OSD was down (not out). This looks very much like a bug for me, isn't it? Restarting an OSD to "repair" crush? Also I did query the pg but it did not show any error. It just lists stats and that the pg was active since 8:40 this morning. There are row(s) with "blocked by" but no value, is that supposed to be filled with data? Kind regards, Kevin 2018-05-17 16:45 GMT+02:00 Paul Emmerich: > Check ceph pg query, it will (usually) tell you why something is stuck > inactive. > > Also: never do min_size 1. > > > Paul > > > 2018-05-17 15:48 GMT+02:00 Kevin Olbrich : > >> I was able to obtain another NVMe to get the HDDs in node1004 into the >> cluster. >> The number of disks (all 1TB) is now balanced between racks, still some >> inactive PGs: >> >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5167 GB used, 14133 GB / 19300 GB avail >> pgs: 1.562% pgs not active >> 1183/1309952 objects degraded (0.090%) >> 199660/1309952 objects misplaced (15.242%) >> 1072 active+clean >> 405 active+remapped+backfill_wait >> 35 active+remapped+backfilling >> 21 activating+remapped >> 3activating+undersized+degraded+remapped >> >> >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 18.85289 root default >> -16 18.85289 datacenter dc01 >> -19 18.85289 pod dc01-agg01 >> -108.98700 rack dc01-rack02 >> -44.03899 host node1001 >> 0 hdd 0.90999 osd.0 up 1.0 1.0 >> 1 hdd 0.90999 osd.1 up 1.0 1.0 >> 5 hdd 0.90999 osd.5 up 1.0 1.0 >> 2 ssd 0.43700 osd.2 up 1.0 1.0 >> 3 ssd 0.43700 osd.3 up 1.0 1.0 >> 4 ssd 0.43700 osd.4 up 1.0 1.0 >> -74.94899 host node1002 >> 9 hdd 0.90999 osd.9 up 1.0 1.0 >> 10 hdd 0.90999 osd.10up 1.0 1.0 >> 11 hdd 0.90999 osd.11up 1.0 1.0 >> 12 hdd 0.90999 osd.12up 1.0 1.0 >> 6 ssd 0.43700 osd.6 up 1.0 1.0 >> 7 ssd 0.43700 osd.7 up 1.0 1.0 >> 8 ssd 0.43700 osd.8 up 1.0 1.0 >> -119.86589 rack dc01-rack03 >> -225.38794 host node1003 >> 17 hdd 0.90999 osd.17up 1.0 1.0 >> 18 hdd 0.90999 osd.18up 1.0 1.0 >> 24 hdd 0.90999 osd.24up 1.0 1.0 >> 26 hdd 0.90999 osd.26up 1.0 1.0 >> 13 ssd 0.43700 osd.13up 1.0 1.0 >> 14 ssd 0.43700 osd.14up 1.0 1.0 >> 15 ssd 0.43700 osd.15up 1.0 1.0 >> 16 ssd 0.43700 osd.16up 1.0 1.0 >> -254.47795 host node1004 >> 23 hdd 0.90999 osd.23up 1.0 1.0 >> 25 hdd 0.90999 osd.25up 1.0 1.0 >> 27 hdd 0.90999 osd.27up 1.0 1.0 >> 19 ssd 0.43700 osd.19up 1.0 1.0 >> 20 ssd 0.43700 osd.20up 1.0 1.0 >> 21 ssd 0.43700 osd.21up 1.0 1.0 >> 22 ssd 0.43700 osd.22up 1.0 1.0 >> >> >> Pools are size 2, min_size 1 during setup. >> >> The count of PGs in activate state are related to the weight of OSDs but >> why are they failing to proceed to active+clean or active+remapped? >> >> Kind regards, >> Kevin >> >> 2018-05-17 14:05 GMT+02:00 Kevin Olbrich : >> >>> Ok, I just waited some time but I still got some "activating" issues: >>> >>> data: >>> pools: 2 pools, 1536 pgs >>> objects: 639k objects, 2554 GB >>> usage: 5194 GB used, 11312 GB / 16506 GB avail >>> pgs: 7.943% pgs not active >>> 5567/1309948 objects degraded (0.425%) >>> 195386/1309948 objects misplaced (14.916%) >>> 1147
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Check ceph pg query, it will (usually) tell you why something is stuck inactive. Also: never do min_size 1. Paul 2018-05-17 15:48 GMT+02:00 Kevin Olbrich: > I was able to obtain another NVMe to get the HDDs in node1004 into the > cluster. > The number of disks (all 1TB) is now balanced between racks, still some > inactive PGs: > > data: > pools: 2 pools, 1536 pgs > objects: 639k objects, 2554 GB > usage: 5167 GB used, 14133 GB / 19300 GB avail > pgs: 1.562% pgs not active > 1183/1309952 objects degraded (0.090%) > 199660/1309952 objects misplaced (15.242%) > 1072 active+clean > 405 active+remapped+backfill_wait > 35 active+remapped+backfilling > 21 activating+remapped > 3activating+undersized+degraded+remapped > > > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 18.85289 root default > -16 18.85289 datacenter dc01 > -19 18.85289 pod dc01-agg01 > -108.98700 rack dc01-rack02 > -44.03899 host node1001 > 0 hdd 0.90999 osd.0 up 1.0 1.0 > 1 hdd 0.90999 osd.1 up 1.0 1.0 > 5 hdd 0.90999 osd.5 up 1.0 1.0 > 2 ssd 0.43700 osd.2 up 1.0 1.0 > 3 ssd 0.43700 osd.3 up 1.0 1.0 > 4 ssd 0.43700 osd.4 up 1.0 1.0 > -74.94899 host node1002 > 9 hdd 0.90999 osd.9 up 1.0 1.0 > 10 hdd 0.90999 osd.10up 1.0 1.0 > 11 hdd 0.90999 osd.11up 1.0 1.0 > 12 hdd 0.90999 osd.12up 1.0 1.0 > 6 ssd 0.43700 osd.6 up 1.0 1.0 > 7 ssd 0.43700 osd.7 up 1.0 1.0 > 8 ssd 0.43700 osd.8 up 1.0 1.0 > -119.86589 rack dc01-rack03 > -225.38794 host node1003 > 17 hdd 0.90999 osd.17up 1.0 1.0 > 18 hdd 0.90999 osd.18up 1.0 1.0 > 24 hdd 0.90999 osd.24up 1.0 1.0 > 26 hdd 0.90999 osd.26up 1.0 1.0 > 13 ssd 0.43700 osd.13up 1.0 1.0 > 14 ssd 0.43700 osd.14up 1.0 1.0 > 15 ssd 0.43700 osd.15up 1.0 1.0 > 16 ssd 0.43700 osd.16up 1.0 1.0 > -254.47795 host node1004 > 23 hdd 0.90999 osd.23up 1.0 1.0 > 25 hdd 0.90999 osd.25up 1.0 1.0 > 27 hdd 0.90999 osd.27up 1.0 1.0 > 19 ssd 0.43700 osd.19up 1.0 1.0 > 20 ssd 0.43700 osd.20up 1.0 1.0 > 21 ssd 0.43700 osd.21up 1.0 1.0 > 22 ssd 0.43700 osd.22up 1.0 1.0 > > > Pools are size 2, min_size 1 during setup. > > The count of PGs in activate state are related to the weight of OSDs but > why are they failing to proceed to active+clean or active+remapped? > > Kind regards, > Kevin > > 2018-05-17 14:05 GMT+02:00 Kevin Olbrich : > >> Ok, I just waited some time but I still got some "activating" issues: >> >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5194 GB used, 11312 GB / 16506 GB avail >> pgs: 7.943% pgs not active >> 5567/1309948 objects degraded (0.425%) >> 195386/1309948 objects misplaced (14.916%) >> 1147 active+clean >> 235 active+remapped+backfill_wait >> * 107 activating+remapped* >> 32 active+remapped+backfilling >> * 15 activating+undersized+degraded+remapped* >> >> I set these settings during runtime: >> ceph tell 'osd.*' injectargs '--osd-max-backfills 16' >> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' >> ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' >> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >> >> Sure, mon_max_pg_per_osd is oversized but this is just temporary. >> Calculated PGs per OSD is 200. >> >> I searched the net and the bugtracker but most posts suggest >> osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I >> got more stuck PGs. >> >> Any more hints? >> >> Kind regards. >> Kevin >> >> 2018-05-17 13:37
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
I was able to obtain another NVMe to get the HDDs in node1004 into the cluster. The number of disks (all 1TB) is now balanced between racks, still some inactive PGs: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5167 GB used, 14133 GB / 19300 GB avail pgs: 1.562% pgs not active 1183/1309952 objects degraded (0.090%) 199660/1309952 objects misplaced (15.242%) 1072 active+clean 405 active+remapped+backfill_wait 35 active+remapped+backfilling 21 activating+remapped 3activating+undersized+degraded+remapped ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 18.85289 root default -16 18.85289 datacenter dc01 -19 18.85289 pod dc01-agg01 -108.98700 rack dc01-rack02 -44.03899 host node1001 0 hdd 0.90999 osd.0 up 1.0 1.0 1 hdd 0.90999 osd.1 up 1.0 1.0 5 hdd 0.90999 osd.5 up 1.0 1.0 2 ssd 0.43700 osd.2 up 1.0 1.0 3 ssd 0.43700 osd.3 up 1.0 1.0 4 ssd 0.43700 osd.4 up 1.0 1.0 -74.94899 host node1002 9 hdd 0.90999 osd.9 up 1.0 1.0 10 hdd 0.90999 osd.10up 1.0 1.0 11 hdd 0.90999 osd.11up 1.0 1.0 12 hdd 0.90999 osd.12up 1.0 1.0 6 ssd 0.43700 osd.6 up 1.0 1.0 7 ssd 0.43700 osd.7 up 1.0 1.0 8 ssd 0.43700 osd.8 up 1.0 1.0 -119.86589 rack dc01-rack03 -225.38794 host node1003 17 hdd 0.90999 osd.17up 1.0 1.0 18 hdd 0.90999 osd.18up 1.0 1.0 24 hdd 0.90999 osd.24up 1.0 1.0 26 hdd 0.90999 osd.26up 1.0 1.0 13 ssd 0.43700 osd.13up 1.0 1.0 14 ssd 0.43700 osd.14up 1.0 1.0 15 ssd 0.43700 osd.15up 1.0 1.0 16 ssd 0.43700 osd.16up 1.0 1.0 -254.47795 host node1004 23 hdd 0.90999 osd.23up 1.0 1.0 25 hdd 0.90999 osd.25up 1.0 1.0 27 hdd 0.90999 osd.27up 1.0 1.0 19 ssd 0.43700 osd.19up 1.0 1.0 20 ssd 0.43700 osd.20up 1.0 1.0 21 ssd 0.43700 osd.21up 1.0 1.0 22 ssd 0.43700 osd.22up 1.0 1.0 Pools are size 2, min_size 1 during setup. The count of PGs in activate state are related to the weight of OSDs but why are they failing to proceed to active+clean or active+remapped? Kind regards, Kevin 2018-05-17 14:05 GMT+02:00 Kevin Olbrich: > Ok, I just waited some time but I still got some "activating" issues: > > data: > pools: 2 pools, 1536 pgs > objects: 639k objects, 2554 GB > usage: 5194 GB used, 11312 GB / 16506 GB avail > pgs: 7.943% pgs not active > 5567/1309948 objects degraded (0.425%) > 195386/1309948 objects misplaced (14.916%) > 1147 active+clean > 235 active+remapped+backfill_wait > * 107 activating+remapped* > 32 active+remapped+backfilling > * 15 activating+undersized+degraded+remapped* > > I set these settings during runtime: > ceph tell 'osd.*' injectargs '--osd-max-backfills 16' > ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' > ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' > ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' > > Sure, mon_max_pg_per_osd is oversized but this is just temporary. > Calculated PGs per OSD is 200. > > I searched the net and the bugtracker but most posts suggest > osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got > more stuck PGs. > > Any more hints? > > Kind regards. > Kevin > > 2018-05-17 13:37 GMT+02:00 Kevin Olbrich : > >> PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by >> default, will place 200 PGs on each OSD. >> I read about the protection in the docs and later noticed that I better >> had only placed 100 PGs. >> >> >> 2018-05-17 13:35 GMT+02:00 Kevin Olbrich : >> >>> Hi!
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Ok, I just waited some time but I still got some "activating" issues: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5194 GB used, 11312 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 195386/1309948 objects misplaced (14.916%) 1147 active+clean 235 active+remapped+backfill_wait * 107 activating+remapped* 32 active+remapped+backfilling * 15 activating+undersized+degraded+remapped* I set these settings during runtime: ceph tell 'osd.*' injectargs '--osd-max-backfills 16' ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4' ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800' ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' Sure, mon_max_pg_per_osd is oversized but this is just temporary. Calculated PGs per OSD is 200. I searched the net and the bugtracker but most posts suggest osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got more stuck PGs. Any more hints? Kind regards. Kevin 2018-05-17 13:37 GMT+02:00 Kevin Olbrich: > PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by > default, will place 200 PGs on each OSD. > I read about the protection in the docs and later noticed that I better > had only placed 100 PGs. > > > 2018-05-17 13:35 GMT+02:00 Kevin Olbrich : > >> Hi! >> >> Thanks for your quick reply. >> Before I read your mail, i applied the following conf to my OSDs: >> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' >> >> Status is now: >> data: >> pools: 2 pools, 1536 pgs >> objects: 639k objects, 2554 GB >> usage: 5211 GB used, 11295 GB / 16506 GB avail >> pgs: 7.943% pgs not active >> 5567/1309948 objects degraded (0.425%) >> 252327/1309948 objects misplaced (19.262%) >> 1030 active+clean >> 351 active+remapped+backfill_wait >> 107 activating+remapped >> 33 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> >> A little bit better but still some non-active PGs. >> I will investigate your other hints! >> >> Thanks >> Kevin >> >> 2018-05-17 13:30 GMT+02:00 Burkhard Linke > bio.uni-giessen.de>: >> >>> Hi, >>> >>> >>> >>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >>> Hi! Today I added some new OSDs (nearly doubled) to my luminous cluster. I then changed pg(p)_num from 256 to 1024 for that pool because it was complaining about to few PGs. (I noticed that should better have been small changes). This is the current status: health: HEALTH_ERR 336568/1307562 objects misplaced (25.740%) Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 pg stale Degraded data redundancy: 6985/1307562 objects degraded (0.534%), 19 pgs degraded, 19 pgs undersized 107 slow requests are blocked > 32 sec 218 stuck requests are blocked > 4096 sec data: pools: 2 pools, 1536 pgs objects: 638k objects, 2549 GB usage: 5210 GB used, 11295 GB / 16506 GB avail pgs: 0.195% pgs unknown 8.138% pgs not active 6985/1307562 objects degraded (0.534%) 336568/1307562 objects misplaced (25.740%) 855 active+clean 517 active+remapped+backfill_wait 107 activating+remapped 31 active+remapped+backfilling 15 activating+undersized+degraded+remapped 4 active+undersized+degraded+remapped+backfilling 3 unknown 3 peering 1 stale+active+clean >>> >>> You need to resolve the unknown/peering/activating pgs first. You have >>> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 >>> OSDs and the heterogenous host sizes, I assume that some OSDs hold more >>> than 200 PGs. There's a threshold for the number of PGs; reaching this >>> threshold keeps the OSDs from accepting new PGs. >>> >>> Try to increase the threshold (mon_max_pg_per_osd / >>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about >>> the exact one, consult the documentation) to allow more PGs on the OSDs. If >>> this is the cause of the problem, the peering and activating states should >>> be resolved within a short time. >>> >>> You can also check the number of PGs per OSD with 'ceph osd df'; the >>> last column is the current number of PGs. >>> >>> OSD tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 16.12177 root default -16
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by default, will place 200 PGs on each OSD. I read about the protection in the docs and later noticed that I better had only placed 100 PGs. 2018-05-17 13:35 GMT+02:00 Kevin Olbrich: > Hi! > > Thanks for your quick reply. > Before I read your mail, i applied the following conf to my OSDs: > ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' > > Status is now: > data: > pools: 2 pools, 1536 pgs > objects: 639k objects, 2554 GB > usage: 5211 GB used, 11295 GB / 16506 GB avail > pgs: 7.943% pgs not active > 5567/1309948 objects degraded (0.425%) > 252327/1309948 objects misplaced (19.262%) > 1030 active+clean > 351 active+remapped+backfill_wait > 107 activating+remapped > 33 active+remapped+backfilling > 15 activating+undersized+degraded+remapped > > A little bit better but still some non-active PGs. > I will investigate your other hints! > > Thanks > Kevin > > 2018-05-17 13:30 GMT+02:00 Burkhard Linke bio.uni-giessen.de>: > >> Hi, >> >> >> >> On 05/17/2018 01:09 PM, Kevin Olbrich wrote: >> >>> Hi! >>> >>> Today I added some new OSDs (nearly doubled) to my luminous cluster. >>> I then changed pg(p)_num from 256 to 1024 for that pool because it was >>> complaining about to few PGs. (I noticed that should better have been >>> small >>> changes). >>> >>> This is the current status: >>> >>> health: HEALTH_ERR >>> 336568/1307562 objects misplaced (25.740%) >>> Reduced data availability: 128 pgs inactive, 3 pgs peering, >>> 1 >>> pg stale >>> Degraded data redundancy: 6985/1307562 objects degraded >>> (0.534%), 19 pgs degraded, 19 pgs undersized >>> 107 slow requests are blocked > 32 sec >>> 218 stuck requests are blocked > 4096 sec >>> >>>data: >>> pools: 2 pools, 1536 pgs >>> objects: 638k objects, 2549 GB >>> usage: 5210 GB used, 11295 GB / 16506 GB avail >>> pgs: 0.195% pgs unknown >>> 8.138% pgs not active >>> 6985/1307562 objects degraded (0.534%) >>> 336568/1307562 objects misplaced (25.740%) >>> 855 active+clean >>> 517 active+remapped+backfill_wait >>> 107 activating+remapped >>> 31 active+remapped+backfilling >>> 15 activating+undersized+degraded+remapped >>> 4 active+undersized+degraded+remapped+backfilling >>> 3 unknown >>> 3 peering >>> 1 stale+active+clean >>> >> >> You need to resolve the unknown/peering/activating pgs first. You have >> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 >> OSDs and the heterogenous host sizes, I assume that some OSDs hold more >> than 200 PGs. There's a threshold for the number of PGs; reaching this >> threshold keeps the OSDs from accepting new PGs. >> >> Try to increase the threshold (mon_max_pg_per_osd / >> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about >> the exact one, consult the documentation) to allow more PGs on the OSDs. If >> this is the cause of the problem, the peering and activating states should >> be resolved within a short time. >> >> You can also check the number of PGs per OSD with 'ceph osd df'; the last >> column is the current number of PGs. >> >> >>> >>> OSD tree: >>> >>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>> -1 16.12177 root default >>> -16 16.12177 datacenter dc01 >>> -19 16.12177 pod dc01-agg01 >>> -108.98700 rack dc01-rack02 >>> -44.03899 host node1001 >>>0 hdd 0.90999 osd.0 up 1.0 1.0 >>>1 hdd 0.90999 osd.1 up 1.0 1.0 >>>5 hdd 0.90999 osd.5 up 1.0 1.0 >>>2 ssd 0.43700 osd.2 up 1.0 1.0 >>>3 ssd 0.43700 osd.3 up 1.0 1.0 >>>4 ssd 0.43700 osd.4 up 1.0 1.0 >>> -74.94899 host node1002 >>>9 hdd 0.90999 osd.9 up 1.0 1.0 >>> 10 hdd 0.90999 osd.10up 1.0 1.0 >>> 11 hdd 0.90999 osd.11up 1.0 1.0 >>> 12 hdd 0.90999 osd.12up 1.0 1.0 >>>6 ssd 0.43700 osd.6 up 1.0 1.0 >>>7 ssd 0.43700 osd.7 up 1.0 1.0 >>>8 ssd 0.43700 osd.8 up 1.0 1.0 >>> -117.13477 rack
Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num
Hi! Thanks for your quick reply. Before I read your mail, i applied the following conf to my OSDs: ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32' Status is now: data: pools: 2 pools, 1536 pgs objects: 639k objects, 2554 GB usage: 5211 GB used, 11295 GB / 16506 GB avail pgs: 7.943% pgs not active 5567/1309948 objects degraded (0.425%) 252327/1309948 objects misplaced (19.262%) 1030 active+clean 351 active+remapped+backfill_wait 107 activating+remapped 33 active+remapped+backfilling 15 activating+undersized+degraded+remapped A little bit better but still some non-active PGs. I will investigate your other hints! Thanks Kevin 2018-05-17 13:30 GMT+02:00 Burkhard Linke < burkhard.li...@computational.bio.uni-giessen.de>: > Hi, > > > > On 05/17/2018 01:09 PM, Kevin Olbrich wrote: > >> Hi! >> >> Today I added some new OSDs (nearly doubled) to my luminous cluster. >> I then changed pg(p)_num from 256 to 1024 for that pool because it was >> complaining about to few PGs. (I noticed that should better have been >> small >> changes). >> >> This is the current status: >> >> health: HEALTH_ERR >> 336568/1307562 objects misplaced (25.740%) >> Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 >> pg stale >> Degraded data redundancy: 6985/1307562 objects degraded >> (0.534%), 19 pgs degraded, 19 pgs undersized >> 107 slow requests are blocked > 32 sec >> 218 stuck requests are blocked > 4096 sec >> >>data: >> pools: 2 pools, 1536 pgs >> objects: 638k objects, 2549 GB >> usage: 5210 GB used, 11295 GB / 16506 GB avail >> pgs: 0.195% pgs unknown >> 8.138% pgs not active >> 6985/1307562 objects degraded (0.534%) >> 336568/1307562 objects misplaced (25.740%) >> 855 active+clean >> 517 active+remapped+backfill_wait >> 107 activating+remapped >> 31 active+remapped+backfilling >> 15 activating+undersized+degraded+remapped >> 4 active+undersized+degraded+remapped+backfilling >> 3 unknown >> 3 peering >> 1 stale+active+clean >> > > You need to resolve the unknown/peering/activating pgs first. You have > 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25 > OSDs and the heterogenous host sizes, I assume that some OSDs hold more > than 200 PGs. There's a threshold for the number of PGs; reaching this > threshold keeps the OSDs from accepting new PGs. > > Try to increase the threshold (mon_max_pg_per_osd / > max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about > the exact one, consult the documentation) to allow more PGs on the OSDs. If > this is the cause of the problem, the peering and activating states should > be resolved within a short time. > > You can also check the number of PGs per OSD with 'ceph osd df'; the last > column is the current number of PGs. > > >> >> OSD tree: >> >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >> -1 16.12177 root default >> -16 16.12177 datacenter dc01 >> -19 16.12177 pod dc01-agg01 >> -108.98700 rack dc01-rack02 >> -44.03899 host node1001 >>0 hdd 0.90999 osd.0 up 1.0 1.0 >>1 hdd 0.90999 osd.1 up 1.0 1.0 >>5 hdd 0.90999 osd.5 up 1.0 1.0 >>2 ssd 0.43700 osd.2 up 1.0 1.0 >>3 ssd 0.43700 osd.3 up 1.0 1.0 >>4 ssd 0.43700 osd.4 up 1.0 1.0 >> -74.94899 host node1002 >>9 hdd 0.90999 osd.9 up 1.0 1.0 >> 10 hdd 0.90999 osd.10up 1.0 1.0 >> 11 hdd 0.90999 osd.11up 1.0 1.0 >> 12 hdd 0.90999 osd.12up 1.0 1.0 >>6 ssd 0.43700 osd.6 up 1.0 1.0 >>7 ssd 0.43700 osd.7 up 1.0 1.0 >>8 ssd 0.43700 osd.8 up 1.0 1.0 >> -117.13477 rack dc01-rack03 >> -225.38678 host node1003 >> 17 hdd 0.90970 osd.17up 1.0 1.0 >> 18 hdd 0.90970 osd.18up 1.0 1.0 >> 24 hdd 0.90970 osd.24up 1.0 1.0 >> 26 hdd 0.90970 osd.26up 1.0 1.0 >> 13 ssd 0.43700
[ceph-users] Blocked requests activating+remapped after extending pg(p)_num
Hi! Today I added some new OSDs (nearly doubled) to my luminous cluster. I then changed pg(p)_num from 256 to 1024 for that pool because it was complaining about to few PGs. (I noticed that should better have been small changes). This is the current status: health: HEALTH_ERR 336568/1307562 objects misplaced (25.740%) Reduced data availability: 128 pgs inactive, 3 pgs peering, 1 pg stale Degraded data redundancy: 6985/1307562 objects degraded (0.534%), 19 pgs degraded, 19 pgs undersized 107 slow requests are blocked > 32 sec 218 stuck requests are blocked > 4096 sec data: pools: 2 pools, 1536 pgs objects: 638k objects, 2549 GB usage: 5210 GB used, 11295 GB / 16506 GB avail pgs: 0.195% pgs unknown 8.138% pgs not active 6985/1307562 objects degraded (0.534%) 336568/1307562 objects misplaced (25.740%) 855 active+clean 517 active+remapped+backfill_wait 107 activating+remapped 31 active+remapped+backfilling 15 activating+undersized+degraded+remapped 4 active+undersized+degraded+remapped+backfilling 3 unknown 3 peering 1 stale+active+clean OSD tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 16.12177 root default -16 16.12177 datacenter dc01 -19 16.12177 pod dc01-agg01 -108.98700 rack dc01-rack02 -44.03899 host node1001 0 hdd 0.90999 osd.0 up 1.0 1.0 1 hdd 0.90999 osd.1 up 1.0 1.0 5 hdd 0.90999 osd.5 up 1.0 1.0 2 ssd 0.43700 osd.2 up 1.0 1.0 3 ssd 0.43700 osd.3 up 1.0 1.0 4 ssd 0.43700 osd.4 up 1.0 1.0 -74.94899 host node1002 9 hdd 0.90999 osd.9 up 1.0 1.0 10 hdd 0.90999 osd.10up 1.0 1.0 11 hdd 0.90999 osd.11up 1.0 1.0 12 hdd 0.90999 osd.12up 1.0 1.0 6 ssd 0.43700 osd.6 up 1.0 1.0 7 ssd 0.43700 osd.7 up 1.0 1.0 8 ssd 0.43700 osd.8 up 1.0 1.0 -117.13477 rack dc01-rack03 -225.38678 host node1003 17 hdd 0.90970 osd.17up 1.0 1.0 18 hdd 0.90970 osd.18up 1.0 1.0 24 hdd 0.90970 osd.24up 1.0 1.0 26 hdd 0.90970 osd.26up 1.0 1.0 13 ssd 0.43700 osd.13up 1.0 1.0 14 ssd 0.43700 osd.14up 1.0 1.0 15 ssd 0.43700 osd.15up 1.0 1.0 16 ssd 0.43700 osd.16up 1.0 1.0 -251.74799 host node1004 19 ssd 0.43700 osd.19up 1.0 1.0 20 ssd 0.43700 osd.20up 1.0 1.0 21 ssd 0.43700 osd.21up 1.0 1.0 22 ssd 0.43700 osd.22up 1.0 1.0 Crush rule is set to chooseleaf rack and (temporary!) to size 2. Why are PGs stuck in peering and activating? "ceph df" shows that only 1,5TB are used on the pool, residing on the hdd's - which would perfectly fit the crush rule(?) Is this only a problem during recovery and the cluster moves to OK after rebalance or can I take any action to unblock IO on the hdd pool? This is a pre-prod cluster, it does not have highest prio but I would appreciate if we would be able to use it before rebalancing is completed. Kind regards, Kevin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked Requests
Hi all, So using ceph-ansible, i built the below mentioned cluster with 2 OSD Nodes and 3 Mons Just after creating osds i started benchmarking the performance using "rbd bench" and "rados bench" and started seeing the performance drop. Checking the status shows slow requests. [root@storage-28-1 ~]# ceph -s cluster: id: 009cbed0-e5a8-4b18-a313-098e55742e85 health: HEALTH_WARN insufficient standby MDS daemons available 1264 slow requests are blocked > 32 sec services: mon: 3 daemons, quorum storage-30,storage-29,storage-28-1 mgr: storage-30(active), standbys: storage-28-1, storage-29 mds: cephfs-3/3/3 up {0=storage-30=up:active,1=storage-28-1=up:active,2=storage-29=up:active} osd: 33 osds: 33 up, 33 in tcmu-runner: 2 daemons active data: pools: 3 pools, 1536 pgs objects: 13289 objects, 42881 MB usage: 102 GB used, 55229 GB / 55331 GB avail pgs: 1536 active+clean io: client: 1694 B/s rd, 1 op/s rd, 0 op/s wr [root@storage-28-1 ~]# ceph health detail HEALTH_WARN insufficient standby MDS daemons available; 904 slow requests are blocked > 32 sec MDS_INSUFFICIENT_STANDBY insufficient standby MDS daemons available have 0; want 1 more REQUEST_SLOW 904 slow requests are blocked > 32 sec 364 ops are blocked > 1048.58 sec 212 ops are blocked > 524.288 sec 164 ops are blocked > 262.144 sec 100 ops are blocked > 131.072 sec 64 ops are blocked > 65.536 sec osd.11 has blocked requests > 524.288 sec osds 9,32 have blocked requests > 1048.58 sec osd 9 log : https://pastebin.com/ex41cFww I see that from time to time different osds are reporting blocked requests. I am not sure what could be the cause of this. Can anyone help me fix this please. [root@storage-28-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 54.03387 root default -3 27.83563 host storage-29 2 hdd 1.63739 osd.2 up 1.0 1.0 3 hdd 1.63739 osd.3 up 1.0 1.0 4 hdd 1.63739 osd.4 up 1.0 1.0 5 hdd 1.63739 osd.5 up 1.0 1.0 6 hdd 1.63739 osd.6 up 1.0 1.0 7 hdd 1.63739 osd.7 up 1.0 1.0 8 hdd 1.63739 osd.8 up 1.0 1.0 9 hdd 1.63739 osd.9 up 1.0 1.0 10 hdd 1.63739 osd.10 up 1.0 1.0 11 hdd 1.63739 osd.11 up 1.0 1.0 12 hdd 1.63739 osd.12 up 1.0 1.0 13 hdd 1.63739 osd.13 up 1.0 1.0 14 hdd 1.63739 osd.14 up 1.0 1.0 15 hdd 1.63739 osd.15 up 1.0 1.0 16 hdd 1.63739 osd.16 up 1.0 1.0 17 hdd 1.63739 osd.17 up 1.0 1.0 18 hdd 1.63739 osd.18 up 1.0 1.0 -5 26.19824 host storage-30 0 hdd 1.63739 osd.0 up 1.0 1.0 1 hdd 1.63739 osd.1 up 1.0 1.0 19 hdd 1.63739 osd.19 up 1.0 1.0 20 hdd 1.63739 osd.20 up 1.0 1.0 21 hdd 1.63739 osd.21 up 1.0 1.0 22 hdd 1.63739 osd.22 up 1.0 1.0 23 hdd 1.63739 osd.23 up 1.0 1.0 24 hdd 1.63739 osd.24 up 1.0 1.0 25 hdd 1.63739 osd.25 up 1.0 1.0 26 hdd 1.63739 osd.26 up 1.0 1.0 27 hdd 1.63739 osd.27 up 1.0 1.0 28 hdd 1.63739 osd.28 up 1.0 1.0 29 hdd 1.63739 osd.29 up 1.0 1.0 30 hdd 1.63739 osd.30 up 1.0 1.0 31 hdd 1.63739 osd.31 up 1.0 1.0 32 hdd 1.63739 osd.32 up 1.0 1.0 thanks On Fri, Apr 20, 2018 at 10:24 AM, Shantur Rathorewrote: > > Thanks Alfredo. I will use ceph-volume. > > On Thu, Apr 19, 2018 at 4:24 PM, Alfredo Deza wrote: >> >> On Thu, Apr 19, 2018 at 11:10 AM, Shantur Rathore >> wrote: >> > Hi, >> > >> > I am building my first Ceph cluster from hardware leftover from a previous >> > project. I have been reading a lot of Ceph documentation but need some help >> > to make sure I going the right way. >> > To set the stage below is what I have >> > >> > Rack-1 >> > >> > 1 x HP DL360 G9 with >> >- 256 GB Memory >> >- 5 x 300GB HDD >> >- 2 x HBA SAS >> >- 4 x 10GBe Networking Card >> > >> > 1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP >> > Enterprise 1.7TB HDD >> > Chassis and HP server are connected with 2 x SAS HBA for redundancy. >> > >> > >> >
Re: [ceph-users] Blocked requests
Hallo Matthew, thanks for your feedback! Please clarify one point: you mean that you recreated the pool as an erasure-coded one, or that you recreated it as a regular replicated one? I mean, you now have an erasure-coded pool in production as a gnocchi backend? In any case, from the instability you mention, experimenting with BlueStore looks like a better alternative. Thanks again Fulvio Original Message Subject: Re: [ceph-users] Blocked requests From: Matthew Stroud <mattstr...@overstock.com> To: Fulvio Galeazzi <fulvio.galea...@garr.it>, Brian Andrus <brian.and...@dreamhost.com> CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Date: 12/13/2017 5:05 PM We fixed it by destroying the pool and recreating it though this isn’t really a fix. Come to find out ceph has a weakness for small high change rate objects (the behavior that gnocchi displays). The cluster will keep going fine until an event (aka a reboot, osd failure, etc) happens. I haven’t been able to find another solution. I have heard that BlueStore handles this better, but that wasn’t stable on the release we are on. Thanks, Matthew Stroud On 12/13/17, 3:56 AM, "Fulvio Galeazzi" <fulvio.galea...@garr.it> wrote: Hallo Matthew, I am now facing the same issue and found this message of yours. Were you eventually able to figure what the problem is, with erasure-coded pools? At first sight, the bugzilla page linked by Brian does not seem to specifically mention erasure-coded pools... Thanks for your help Fulvio Original Message ---- Subject: Re: [ceph-users] Blocked requests From: Matthew Stroud <mattstr...@overstock.com> To: Brian Andrus <brian.and...@dreamhost.com> CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Date: 09/07/2017 11:01 PM > After some troubleshooting, the issues appear to be caused by gnocchi > using rados. I’m trying to figure out why. > > Thanks, > > Matthew Stroud > > *From: *Brian Andrus <brian.and...@dreamhost.com> > *Date: *Thursday, September 7, 2017 at 1:53 PM > *To: *Matthew Stroud <mattstr...@overstock.com> > *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" > <ceph-users@lists.ceph.com> > *Subject: *Re: [ceph-users] Blocked requests > > "ceph osd blocked-by" can do the same thing as that provided script. > > Can you post relevant osd.10 logs and a pg dump of an affected placement > group? Specifically interested in recovery_state section. > > Hopefully you were careful in how you were rebooting OSDs, and not > rebooting multiple in the same failure domain before recovery was able > to occur. > > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud > <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote: > > Here is the output of your snippet: > > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh > >6 osd.10 > > 52 ops are blocked > 4194.3 sec on osd.17 > > 9 ops are blocked > 2097.15 sec on osd.10 > > 4 ops are blocked > 1048.58 sec on osd.10 > > 39 ops are blocked > 262.144 sec on osd.10 > > 19 ops are blocked > 131.072 sec on osd.10 > > 6 ops are blocked > 65.536 sec on osd.10 > > 2 ops are blocked > 32.768 sec on osd.10 > > Here is some backfilling info: > > [root@mon01 ceph-conf]# ceph status > > cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 > > health HEALTH_WARN > > 5 pgs backfilling > > 5 pgs degraded > > 5 pgs stuck degraded > > 5 pgs stuck unclean > > 5 pgs stuck undersized > > 5 pgs undersized > > 122 requests are blocked > 32 sec > > recovery 2361/1097929 objects degraded (0.215%) > > recovery 5578/1097929 objects misplaced (0.508%) > > monmap e1: 3 mons at > {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0 > <http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>} > > election epoch 58, quor
Re: [ceph-users] Blocked requests
We fixed it by destroying the pool and recreating it though this isn’t really a fix. Come to find out ceph has a weakness for small high change rate objects (the behavior that gnocchi displays). The cluster will keep going fine until an event (aka a reboot, osd failure, etc) happens. I haven’t been able to find another solution. I have heard that BlueStore handles this better, but that wasn’t stable on the release we are on. Thanks, Matthew Stroud On 12/13/17, 3:56 AM, "Fulvio Galeazzi" <fulvio.galea...@garr.it> wrote: Hallo Matthew, I am now facing the same issue and found this message of yours. Were you eventually able to figure what the problem is, with erasure-coded pools? At first sight, the bugzilla page linked by Brian does not seem to specifically mention erasure-coded pools... Thanks for your help Fulvio Original Message Subject: Re: [ceph-users] Blocked requests From: Matthew Stroud <mattstr...@overstock.com> To: Brian Andrus <brian.and...@dreamhost.com> CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Date: 09/07/2017 11:01 PM > After some troubleshooting, the issues appear to be caused by gnocchi > using rados. I’m trying to figure out why. > > Thanks, > > Matthew Stroud > > *From: *Brian Andrus <brian.and...@dreamhost.com> > *Date: *Thursday, September 7, 2017 at 1:53 PM > *To: *Matthew Stroud <mattstr...@overstock.com> > *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" > <ceph-users@lists.ceph.com> > *Subject: *Re: [ceph-users] Blocked requests > > "ceph osd blocked-by" can do the same thing as that provided script. > > Can you post relevant osd.10 logs and a pg dump of an affected placement > group? Specifically interested in recovery_state section. > > Hopefully you were careful in how you were rebooting OSDs, and not > rebooting multiple in the same failure domain before recovery was able > to occur. > > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud > <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote: > > Here is the output of your snippet: > > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh > >6 osd.10 > > 52 ops are blocked > 4194.3 sec on osd.17 > > 9 ops are blocked > 2097.15 sec on osd.10 > > 4 ops are blocked > 1048.58 sec on osd.10 > > 39 ops are blocked > 262.144 sec on osd.10 > > 19 ops are blocked > 131.072 sec on osd.10 > > 6 ops are blocked > 65.536 sec on osd.10 > > 2 ops are blocked > 32.768 sec on osd.10 > > Here is some backfilling info: > > [root@mon01 ceph-conf]# ceph status > > cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 > > health HEALTH_WARN > > 5 pgs backfilling > > 5 pgs degraded > > 5 pgs stuck degraded > > 5 pgs stuck unclean > > 5 pgs stuck undersized > > 5 pgs undersized > > 122 requests are blocked > 32 sec > > recovery 2361/1097929 objects degraded (0.215%) > > recovery 5578/1097929 objects misplaced (0.508%) > > monmap e1: 3 mons at > {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0 > <http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>} > > election epoch 58, quorum 0,1,2 mon01,mon02,mon03 > > osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs > > flags sortbitwise,require_jewel_osds > >pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects > > 1005 GB used, 20283 GB / 21288 GB avail > > 2361/1097929 objects degraded (0.215%) > > 5578/1097929 objects misplaced (0.508%) > > 2587 active+clean > > 5 active+undersized+degraded+remapped+backfilling > > [root@mon01 ceph-conf]# ceph pg dump_stuck unclean > > ok > > pg_stat state up up_primary acting acting_primary > > 3.5c2 active+und
Re: [ceph-users] Blocked requests
Hallo Matthew, I am now facing the same issue and found this message of yours. Were you eventually able to figure what the problem is, with erasure-coded pools? At first sight, the bugzilla page linked by Brian does not seem to specifically mention erasure-coded pools... Thanks for your help Fulvio Original Message Subject: Re: [ceph-users] Blocked requests From: Matthew Stroud <mattstr...@overstock.com> To: Brian Andrus <brian.and...@dreamhost.com> CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Date: 09/07/2017 11:01 PM After some troubleshooting, the issues appear to be caused by gnocchi using rados. I’m trying to figure out why. Thanks, Matthew Stroud *From: *Brian Andrus <brian.and...@dreamhost.com> *Date: *Thursday, September 7, 2017 at 1:53 PM *To: *Matthew Stroud <mattstr...@overstock.com> *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> *Subject: *Re: [ceph-users] Blocked requests "ceph osd blocked-by" can do the same thing as that provided script. Can you post relevant osd.10 logs and a pg dump of an affected placement group? Specifically interested in recovery_state section. Hopefully you were careful in how you were rebooting OSDs, and not rebooting multiple in the same failure domain before recovery was able to occur. On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote: Here is the output of your snippet: [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh 6 osd.10 52 ops are blocked > 4194.3 sec on osd.17 9 ops are blocked > 2097.15 sec on osd.10 4 ops are blocked > 1048.58 sec on osd.10 39 ops are blocked > 262.144 sec on osd.10 19 ops are blocked > 131.072 sec on osd.10 6 ops are blocked > 65.536 sec on osd.10 2 ops are blocked > 32.768 sec on osd.10 Here is some backfilling info: [root@mon01 ceph-conf]# ceph status cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 health HEALTH_WARN 5 pgs backfilling 5 pgs degraded 5 pgs stuck degraded 5 pgs stuck unclean 5 pgs stuck undersized 5 pgs undersized 122 requests are blocked > 32 sec recovery 2361/1097929 objects degraded (0.215%) recovery 5578/1097929 objects misplaced (0.508%) monmap e1: 3 mons at {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0 <http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>} election epoch 58, quorum 0,1,2 mon01,mon02,mon03 osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs flags sortbitwise,require_jewel_osds pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects 1005 GB used, 20283 GB / 21288 GB avail 2361/1097929 objects degraded (0.215%) 5578/1097929 objects misplaced (0.508%) 2587 active+clean 5 active+undersized+degraded+remapped+backfilling [root@mon01 ceph-conf]# ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 3.5c2 active+undersized+degraded+remapped+backfilling [17,2,10] 17 [17,2] 17 3.54a active+undersized+degraded+remapped+backfilling [10,19,2] 10 [10,17] 10 5.3b active+undersized+degraded+remapped+backfilling [3,19,0] 3 [10,17] 10 5.b3 active+undersized+degraded+remapped+backfilling [10,19,2] 10 [10,17] 10 3.180 active+undersized+degraded+remapped+backfilling [17,10,6] 17 [22,19] 22 Most of the back filling is was caused by restarting osds to clear blocked IO. Here are some of the blocked IOs: /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 10.20.57.15:6806/7029 <http://10.20.57.15:6806/7029> 9362 : cluster [WRN] slow request 60.834494 seconds old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 10.20.57.15:6806/7029 <http://10.20.57.15:6806/7029> 9363 : cluster [WRN] slow request 240.661052 seconds old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 10.20.57.15:6806/7029 <http://10.20.57.15:6
Re: [ceph-users] Blocked requests
Is it this? https://bugzilla.redhat.com/show_bug.cgi?id=1430588 On Fri, Sep 8, 2017 at 7:01 AM, Matthew Stroud <mattstr...@overstock.com> wrote: > After some troubleshooting, the issues appear to be caused by gnocchi using > rados. I’m trying to figure out why. > > > > Thanks, > > Matthew Stroud > > > > From: Brian Andrus <brian.and...@dreamhost.com> > Date: Thursday, September 7, 2017 at 1:53 PM > To: Matthew Stroud <mattstr...@overstock.com> > Cc: David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" > <ceph-users@lists.ceph.com> > > > Subject: Re: [ceph-users] Blocked requests > > > > "ceph osd blocked-by" can do the same thing as that provided script. > > > > Can you post relevant osd.10 logs and a pg dump of an affected placement > group? Specifically interested in recovery_state section. > > > > Hopefully you were careful in how you were rebooting OSDs, and not rebooting > multiple in the same failure domain before recovery was able to occur. > > > > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud <mattstr...@overstock.com> > wrote: > > Here is the output of your snippet: > > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh > > 6 osd.10 > > 52 ops are blocked > 4194.3 sec on osd.17 > > 9 ops are blocked > 2097.15 sec on osd.10 > > 4 ops are blocked > 1048.58 sec on osd.10 > > 39 ops are blocked > 262.144 sec on osd.10 > > 19 ops are blocked > 131.072 sec on osd.10 > > 6 ops are blocked > 65.536 sec on osd.10 > > 2 ops are blocked > 32.768 sec on osd.10 > > > > Here is some backfilling info: > > > > [root@mon01 ceph-conf]# ceph status > > cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 > > health HEALTH_WARN > > 5 pgs backfilling > > 5 pgs degraded > > 5 pgs stuck degraded > > 5 pgs stuck unclean > > 5 pgs stuck undersized > > 5 pgs undersized > > 122 requests are blocked > 32 sec > > recovery 2361/1097929 objects degraded (0.215%) > > recovery 5578/1097929 objects misplaced (0.508%) > > monmap e1: 3 mons at > {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0} > > election epoch 58, quorum 0,1,2 mon01,mon02,mon03 > > osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs > > flags sortbitwise,require_jewel_osds > > pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects > > 1005 GB used, 20283 GB / 21288 GB avail > > 2361/1097929 objects degraded (0.215%) > > 5578/1097929 objects misplaced (0.508%) > > 2587 active+clean > >5 active+undersized+degraded+remapped+backfilling > > [root@mon01 ceph-conf]# ceph pg dump_stuck unclean > > ok > > pg_stat state up up_primary acting acting_primary > > 3.5c2 active+undersized+degraded+remapped+backfilling [17,2,10] 17 > [17,2] 17 > > 3.54a active+undersized+degraded+remapped+backfilling [10,19,2] 10 > [10,17] 10 > > 5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3 > [10,17] 10 > > 5.b3active+undersized+degraded+remapped+backfilling [10,19,2] 10 > [10,17] 10 > > 3.180 active+undersized+degraded+remapped+backfilling [17,10,6] 17 > [22,19] 22 > > > > Most of the back filling is was caused by restarting osds to clear blocked > IO. Here are some of the blocked IOs: > > > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 > 10.20.57.15:6806/7029 9362 : cluster [WRN] slow request 60.834494 seconds > old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090 > 5.e637a4b3 (undecoded) > ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 > 10.20.57.15:6806/7029 9363 : cluster [WRN] slow request 240.661052 seconds > old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3 > 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 > 10.20.57.15:6806/7029 9364 : cluster [WRN] slow request 240.660763 seconds > old, received at 2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2 > 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 > 10.20.57.15:68
Re: [ceph-users] Blocked requests
After some troubleshooting, the issues appear to be caused by gnocchi using rados. I’m trying to figure out why. Thanks, Matthew Stroud From: Brian Andrus <brian.and...@dreamhost.com> Date: Thursday, September 7, 2017 at 1:53 PM To: Matthew Stroud <mattstr...@overstock.com> Cc: David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Blocked requests "ceph osd blocked-by" can do the same thing as that provided script. Can you post relevant osd.10 logs and a pg dump of an affected placement group? Specifically interested in recovery_state section. Hopefully you were careful in how you were rebooting OSDs, and not rebooting multiple in the same failure domain before recovery was able to occur. On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud <mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote: Here is the output of your snippet: [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh 6 osd.10 52 ops are blocked > 4194.3 sec on osd.17 9 ops are blocked > 2097.15 sec on osd.10 4 ops are blocked > 1048.58 sec on osd.10 39 ops are blocked > 262.144 sec on osd.10 19 ops are blocked > 131.072 sec on osd.10 6 ops are blocked > 65.536 sec on osd.10 2 ops are blocked > 32.768 sec on osd.10 Here is some backfilling info: [root@mon01 ceph-conf]# ceph status cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155 health HEALTH_WARN 5 pgs backfilling 5 pgs degraded 5 pgs stuck degraded 5 pgs stuck unclean 5 pgs stuck undersized 5 pgs undersized 122 requests are blocked > 32 sec recovery 2361/1097929 objects degraded (0.215%) recovery 5578/1097929 objects misplaced (0.508%) monmap e1: 3 mons at {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>} election epoch 58, quorum 0,1,2 mon01,mon02,mon03 osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs flags sortbitwise,require_jewel_osds pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects 1005 GB used, 20283 GB / 21288 GB avail 2361/1097929 objects degraded (0.215%) 5578/1097929 objects misplaced (0.508%) 2587 active+clean 5 active+undersized+degraded+remapped+backfilling [root@mon01 ceph-conf]# ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 3.5c2 active+undersized+degraded+remapped+backfilling [17,2,10] 17 [17,2] 17 3.54a active+undersized+degraded+remapped+backfilling [10,19,2] 10 [10,17] 10 5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3 [10,17] 10 5.b3active+undersized+degraded+remapped+backfilling [10,19,2] 10 [10,17] 10 3.180 active+undersized+degraded+remapped+backfilling [17,10,6] 17 [22,19] 22 Most of the back filling is was caused by restarting osds to clear blocked IO. Here are some of the blocked IOs: /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9362 : cluster [WRN] slow request 60.834494 seconds old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9363 : cluster [WRN] slow request 240.661052 seconds old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9364 : cluster [WRN] slow request 240.660763 seconds old, received at 2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9365 : cluster [WRN] slow request 240.660675 seconds old, received at 2017-09-07 13:25:36.317740: osd_op(client.246944377.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently queued_for_pg /var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9366 : cluster [WRN] 72 slow requests, 3 included below; oldest blocked for > 1820.342287 secs /var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10 10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9367 : cluster [WRN] slow request 30.606290 seconds old, received at 2017-
Re: [ceph-users] Blocked requests
017-09-07 13:29:42.979377 osd.10 > 10.20.57.15:6806/7029 9368 : cluster [WRN] slow request 30.554317 seconds > old, received at 2017-09-07 13:29:12.424972: osd_op(client.115020.0:1831942 > 5.39f2d3b (undecoded) ack+read+known_if_redirected e6511) currently > queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:42.979383 osd.10 > 10.20.57.15:6806/7029 9369 : cluster [WRN] slow request 30.368086 seconds > old, received at 2017-09-07 13:29:12.611204: osd_op(client.115014.0:73392774 > 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected > e6511) currently queued_for_pg > > /var/log/ceph/ceph.log:2017-09-07 13:29:43.979553 osd.10 > 10.20.57.15:6806/7029 9370 : cluster [WRN] 73 slow requests, 1 included > below; oldest blocked for > 1821.342499 secs > > /var/log/ceph/ceph.log:2017-09-07 13:29:43.979559 osd.10 > 10.20.57.15:6806/7029 9371 : cluster [WRN] slow request 30.452344 seconds > old, received at 2017-09-07 13:29:13.527157: osd_op(client.115011.0:483954528 > 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected > e6511) currently queued_for_pg > > > > *From: *David Turner <drakonst...@gmail.com> > *Date: *Thursday, September 7, 2017 at 1:17 PM > > *To: *Matthew Stroud <mattstr...@overstock.com>, " > ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> > *Subject: *Re: [ceph-users] Blocked requests > > > > I would recommend pushing forward with the update instead of rolling > back. Ceph doesn't have a track record of rolling back to a previous > version. > > > > I don't have enough information to really make sense of the ceph health > detail output. Like are the osds listed all on the same host? Over time > of watching this output, are some of the requests clearing up? Are there > any other patterns? I put the following in a script and run it in a watch > command to try and follow patterns when I'm plagued with blocked requests. > > output=$(ceph --cluster $cluster health detail | grep 'ops are > blocked' | sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t > -s'+') > > echo "$output" | grep -v 'on osd' > > echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 > ' > > echo "$output" | grep 'on osd' > > > > Why do you have backfilling? You haven't mentioned that you have any > backfilling yet. Installing an update shouldn't cause backfilling, but > it's likely related to your blocked requests. > > > > On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud <mattstr...@overstock.com> > wrote: > > Well in the meantime things have gone from bad to worse now the cluster > isn’t rebuilding and clients are unable to pass IO to the cluster. When > this first took place, we started rolling back to 10.2.7, though that was > successful, it didn’t help with the issue. Here is the command output: > > > > HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 > pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs > undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; > recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 > objects misplaced (0.944%) > > pg 3.624 is stuck unclean for 1402.022837, current state > active+undersized+degraded+remapped+wait_backfill, last acting [12,9] > > pg 3.587 is stuck unclean for 2536.693566, current state > active+undersized+degraded+remapped+wait_backfill, last acting [18,13] > > pg 3.45f is stuck unclean for 1421.178244, current state > active+undersized+degraded+remapped+wait_backfill, last acting [14,10] > > pg 3.41a is stuck unclean for 1505.091187, current state > active+undersized+degraded+remapped+wait_backfill, last acting [9,23] > > pg 3.4cc is stuck unclean for 1560.824332, current state > active+undersized+degraded+remapped+wait_backfill, last acting [18,10] > > < snip> > > pg 3.188 is stuck degraded for 1207.118130, current state > active+undersized+degraded+remapped+wait_backfill, last acting [14,17] > > pg 3.768 is stuck degraded for 1123.722910, current state > active+undersized+degraded+remapped+wait_backfill, last acting [11,18] > > pg 3.77c is stuck degraded for 1211.981606, current state > active+undersized+degraded+remapped+wait_backfill, last acting [9,2] > > pg 3.7d1 is stuck degraded for 1074.422756, current state > active+undersized+degraded+remapped+wait_backfill, last acting [10,12] > > pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting > [10,12] > > pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting > [9,2] > > pg 3.768 is active+undersized
Re: [ceph-users] Blocked requests
currently queued_for_pg From: David Turner <drakonst...@gmail.com> Date: Thursday, September 7, 2017 at 1:17 PM To: Matthew Stroud <mattstr...@overstock.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Blocked requests I would recommend pushing forward with the update instead of rolling back. Ceph doesn't have a track record of rolling back to a previous version. I don't have enough information to really make sense of the ceph health detail output. Like are the osds listed all on the same host? Over time of watching this output, are some of the requests clearing up? Are there any other patterns? I put the following in a script and run it in a watch command to try and follow patterns when I'm plagued with blocked requests. output=$(ceph --cluster $cluster health detail | grep 'ops are blocked' | sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t -s'+') echo "$output" | grep -v 'on osd' echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 ' echo "$output" | grep 'on osd' Why do you have backfilling? You haven't mentioned that you have any backfilling yet. Installing an update shouldn't cause backfilling, but it's likely related to your blocked requests. On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud <mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote: Well in the meantime things have gone from bad to worse now the cluster isn’t rebuilding and clients are unable to pass IO to the cluster. When this first took place, we started rolling back to 10.2.7, though that was successful, it didn’t help with the issue. Here is the command output: HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects misplaced (0.944%) pg 3.624 is stuck unclean for 1402.022837, current state active+undersized+degraded+remapped+wait_backfill, last acting [12,9] pg 3.587 is stuck unclean for 2536.693566, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,13] pg 3.45f is stuck unclean for 1421.178244, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,10] pg 3.41a is stuck unclean for 1505.091187, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,23] pg 3.4cc is stuck unclean for 1560.824332, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,10] < snip> pg 3.188 is stuck degraded for 1207.118130, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,17] pg 3.768 is stuck degraded for 1123.722910, current state active+undersized+degraded+remapped+wait_backfill, last acting [11,18] pg 3.77c is stuck degraded for 1211.981606, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,2] pg 3.7d1 is stuck degraded for 1074.422756, current state active+undersized+degraded+remapped+wait_backfill, last acting [10,12] pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12] pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2] pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18] pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4] pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10] pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19] pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21] pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9] 2 ops are blocked > 1048.58 sec on osd.9 3 ops are blocked > 65.536 sec on osd.9 7 ops are blocked > 1048.58 sec on osd.8 1 ops are blocked > 524.288 sec on osd.8 1 ops are blocked > 131.072 sec on osd.8 1 ops are blocked > 524.288 sec on osd.2 1 ops are blocked > 262.144 sec on osd.2 2 ops are blocked > 65.536 sec on osd.21 9 ops are blocked > 1048.58 sec on osd.5 9 ops are blocked > 524.288 sec on osd.5 71 ops are blocked > 131.072 sec on osd.5 19 ops are blocked > 65.536 sec on osd.5 35 ops are blocked > 32.768 sec on osd.5 14 osds have slow requests recovery 4678/1097738 objects degraded (0.426%) recovery 10364/1097738 objects misplaced (0.944%) From: David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>> Date: Thursday, September 7, 2017 at 11:33 AM To: Matthew Stroud <mattstr...@overstock.com<mailto:mattstr...@overstock.com>>, "ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject: Re: [ceph-users] Blocked requests To be f
Re: [ceph-users] Blocked requests
Well in the meantime things have gone from bad to worse now the cluster isn’t rebuilding and clients are unable to pass IO to the cluster. When this first took place, we started rolling back to 10.2.7, though that was successful, it didn’t help with the issue. Here is the command output: HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects misplaced (0.944%) pg 3.624 is stuck unclean for 1402.022837, current state active+undersized+degraded+remapped+wait_backfill, last acting [12,9] pg 3.587 is stuck unclean for 2536.693566, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,13] pg 3.45f is stuck unclean for 1421.178244, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,10] pg 3.41a is stuck unclean for 1505.091187, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,23] pg 3.4cc is stuck unclean for 1560.824332, current state active+undersized+degraded+remapped+wait_backfill, last acting [18,10] < snip> pg 3.188 is stuck degraded for 1207.118130, current state active+undersized+degraded+remapped+wait_backfill, last acting [14,17] pg 3.768 is stuck degraded for 1123.722910, current state active+undersized+degraded+remapped+wait_backfill, last acting [11,18] pg 3.77c is stuck degraded for 1211.981606, current state active+undersized+degraded+remapped+wait_backfill, last acting [9,2] pg 3.7d1 is stuck degraded for 1074.422756, current state active+undersized+degraded+remapped+wait_backfill, last acting [10,12] pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12] pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2] pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18] pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4] pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10] pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19] pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21] pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9] 2 ops are blocked > 1048.58 sec on osd.9 3 ops are blocked > 65.536 sec on osd.9 7 ops are blocked > 1048.58 sec on osd.8 1 ops are blocked > 524.288 sec on osd.8 1 ops are blocked > 131.072 sec on osd.8 1 ops are blocked > 524.288 sec on osd.2 1 ops are blocked > 262.144 sec on osd.2 2 ops are blocked > 65.536 sec on osd.21 9 ops are blocked > 1048.58 sec on osd.5 9 ops are blocked > 524.288 sec on osd.5 71 ops are blocked > 131.072 sec on osd.5 19 ops are blocked > 65.536 sec on osd.5 35 ops are blocked > 32.768 sec on osd.5 14 osds have slow requests recovery 4678/1097738 objects degraded (0.426%) recovery 10364/1097738 objects misplaced (0.944%) From: David Turner <drakonst...@gmail.com> Date: Thursday, September 7, 2017 at 11:33 AM To: Matthew Stroud <mattstr...@overstock.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Blocked requests To be fair, other times I have to go in and tweak configuration settings and timings to resolve chronic blocked requests. On Thu, Sep 7, 2017 at 1:32 PM David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote: `ceph health detail` will give a little more information into the blocked requests. Specifically which OSDs are the requests blocked on and how long have they actually been blocked (as opposed to '> 32 sec'). I usually find a pattern after watching that for a time and narrow things down to an OSD, journal, etc. Some times I just need to restart a specific OSD and all is well. On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud <mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote: After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for ‘currently waiting for missing object’. I have tried bouncing the osds and rebooting the osd nodes, but that just moves the problems around. Previous to this upgrade we had no issues. Any ideas of what to look at? Thanks, Matthew Stroud CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received
Re: [ceph-users] Blocked requests
To be fair, other times I have to go in and tweak configuration settings and timings to resolve chronic blocked requests. On Thu, Sep 7, 2017 at 1:32 PM David Turnerwrote: > `ceph health detail` will give a little more information into the blocked > requests. Specifically which OSDs are the requests blocked on and how long > have they actually been blocked (as opposed to '> 32 sec'). I usually find > a pattern after watching that for a time and narrow things down to an OSD, > journal, etc. Some times I just need to restart a specific OSD and all is > well. > > On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud > wrote: > >> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests >> for ‘currently waiting for missing object’. I have tried bouncing the osds >> and rebooting the osd nodes, but that just moves the problems around. >> Previous to this upgrade we had no issues. Any ideas of what to look at? >> >> >> >> Thanks, >> >> Matthew Stroud >> >> -- >> >> CONFIDENTIALITY NOTICE: This message is intended only for the use and >> review of the individual or entity to which it is addressed and may contain >> information that is privileged and confidential. If the reader of this >> message is not the intended recipient, or the employee or agent responsible >> for delivering the message solely to the intended recipient, you are hereby >> notified that any dissemination, distribution or copying of this >> communication is strictly prohibited. If you have received this >> communication in error, please notify sender immediately by telephone or >> return email. Thank you. >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests
`ceph health detail` will give a little more information into the blocked requests. Specifically which OSDs are the requests blocked on and how long have they actually been blocked (as opposed to '> 32 sec'). I usually find a pattern after watching that for a time and narrow things down to an OSD, journal, etc. Some times I just need to restart a specific OSD and all is well. On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroudwrote: > After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests > for ‘currently waiting for missing object’. I have tried bouncing the osds > and rebooting the osd nodes, but that just moves the problems around. > Previous to this upgrade we had no issues. Any ideas of what to look at? > > > > Thanks, > > Matthew Stroud > > -- > > CONFIDENTIALITY NOTICE: This message is intended only for the use and > review of the individual or entity to which it is addressed and may contain > information that is privileged and confidential. If the reader of this > message is not the intended recipient, or the employee or agent responsible > for delivering the message solely to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify sender immediately by telephone or > return email. Thank you. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked requests
After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for ‘currently waiting for missing object’. I have tried bouncing the osds and rebooting the osd nodes, but that just moves the problems around. Previous to this upgrade we had no issues. Any ideas of what to look at? Thanks, Matthew Stroud CONFIDENTIALITY NOTICE: This message is intended only for the use and review of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message solely to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify sender immediately by telephone or return email. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests problem
Finally problem solved. First, I set noscrub, nodeep-scrub, norebalance, nobackfill, norecover, noup and nodown flags. Then I restarted the OSD which has problem. When OSD daemon started, blocked requests increased (up to 100) and some misplaced PGs appeared. Then I unset flags in order to noup, nodown, norecover, nobackfill, norebalance. In a little while, all misplaced PGs repaired. Then I unset noscrub and nodeep-scrub flags. And finally: HEALTH_OK. Thanks for your helps, Ramazan > On 22 Aug 2017, at 20:46, Ranjan Ghoshwrote: > > Hm. That's quite weird. On our cluster, when I set "noscrub", "nodeep-scrub", > scrubbing will always stop pretty quickly (a few minutes). I wonder why this > doesnt happen on your cluster. When exactly did you set the flag? Perhaps it > just needs some more time... Or there might be a disk problem why the > scrubbing never finishes. Perhaps it's really a good idea, just like you > proposed, to shutdown the corresponding OSDs. But that's just my thoughts. > Perhaps some Ceph pro can shed some light on the possible reasons, why a > scrubbing might get stuck and how to resolve this. > > > Am 22.08.2017 um 18:58 schrieb Ramazan Terzi: >> Hi Ranjan, >> >> Thanks for your reply. I did set scrub and nodeep-scrub flags. But active >> scrubbing operation can’t working properly. Scrubbing operation always in >> same pg (20.1e). >> >> $ ceph pg dump | grep scrub >> dumped all in format plain >> pg_stat objects mip degrmispunf bytes log disklog >> state state_stamp v reportedup up_primary >> acting acting_primary last_scrub scrub_stamp last_deep_scrub >> deep_scrub_stamp >> 20.1e25189 0 0 0 0 98359116362 3048 >> 3048active+clean+scrubbing 2017-08-21 04:55:13.354379 >> 6930'2393 6930:20949058 [29,31,3] 29 [29,31,3] 29 >>6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 >> 2017-08-20 04:46:59.208792 >> >> >> $ ceph -s >> cluster >> health HEALTH_WARN >> 33 requests are blocked > 32 sec >> noscrub,nodeep-scrub flag(s) set >> monmap e9: 3 mons at >> {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0} >> election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 >> osdmap e6930: 36 osds: 36 up, 36 in >> flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds >> pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects >> 70497 GB used, 127 TB / 196 TB avail >> 1407 active+clean >>1 active+clean+scrubbing >> >> >> Thanks, >> Ramazan >> >> >>> On 22 Aug 2017, at 18:52, Ranjan Ghosh wrote: >>> >>> Hi Ramazan, >>> >>> I'm no Ceph expert, but what I can say from my experience using Ceph is: >>> >>> 1) During "Scrubbing", Ceph can be extremely slow. This is probably where >>> your "blocked requests" are coming from. BTW: Perhaps you can even find out >>> which processes are currently blocking with: ps aux | grep "D". You might >>> even want to kill some of those and/or shutdown services in order to >>> relieve some stress from the machine until it recovers. >>> >>> 2) I usually have the following in my ceph.conf. This lets the scrubbing >>> only run between midnight and 6 AM (hopefully the time of least demand; >>> adjust as necessary) - and with the lowest priority. >>> >>> #Reduce impact of scrub. >>> osd_disk_thread_ioprio_priority = 7 >>> osd_disk_thread_ioprio_class = "idle" >>> osd_scrub_end_hour = 6 >>> >>> 3) The Scrubbing begin and end hour will always work. The low priority >>> mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your >>> current scheduler like this (replace sda with your device): >>> >>> cat /sys/block/sda/queue/scheduler >>> >>> You can also echo to this file to set a different scheduler. >>> >>> >>> With these settings you can perhaps alleviate the problem so far, that the >>> scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt >>> have to finish in one night. It will continue the next night and so on. >>> >>> The Ceph experts say scrubbing is important. Don't know why, but I just >>> believe them. They've built this complex stuff after all :-) >>> >>> Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back >>> to work, but you should not let it run like this forever and a day. >>> >>> Hope this helps at least a bit. >>> >>> BR, >>> >>> Ranjan >>> >>> >>> Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default
Re: [ceph-users] Blocked requests problem
Hi, Sometimes we have the same issue on our 10.2.9 Cluster. (24 Nodes á 60 OSDs) I think there is some racecondition or something like that which results in this state. The blocking requests starts exactly at the time the PG begins to scrub. you can try the following. The OSD will automaticaly recover and the blocked requests will disapear. ceph osd down 31 In my opinion this is a bug but I have note investigated so far. Mayby some developer can say something about this issue Regards, Manuel Am Tue, 22 Aug 2017 16:20:14 +0300 schrieb Ramazan Terzi: > Hello, > > I have a Ceph Cluster with specifications below: > 3 x Monitor node > 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks > have SSD journals) Distributed public and private networks. All NICs > are 10Gbit/s osd pool default size = 3 > osd pool default min size = 2 > > Ceph version is Jewel 10.2.6. > > My cluster is active and a lot of virtual machines running on it > (Linux and Windows VM's, database clusters, web servers etc). > > During normal use, cluster slowly went into a state of blocked > requests. Blocked requests periodically incrementing. All OSD's seems > healthy. Benchmark, iowait, network tests, all of them succeed. > > Yerterday, 08:00: > $ ceph health detail > HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests > 1 ops are blocked > 134218 sec on osd.31 > 1 ops are blocked > 134218 sec on osd.3 > 1 ops are blocked > 8388.61 sec on osd.29 > 3 osds have slow requests > > Todat, 16:05: > $ ceph health detail > HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow > requests 1 ops are blocked > 134218 sec on osd.31 > 1 ops are blocked > 134218 sec on osd.3 > 16 ops are blocked > 134218 sec on osd.29 > 11 ops are blocked > 67108.9 sec on osd.29 > 2 ops are blocked > 16777.2 sec on osd.29 > 1 ops are blocked > 8388.61 sec on osd.29 > 3 osds have slow requests > > $ ceph pg dump | grep scrub > dumped all in format plain > pg_stat objects mip degrmisp > unf bytes log disklog state > state_stamp v reportedup > up_primaryacting acting_primary > last_scrubscrub_stamp last_deep_scrub > deep_scrub_stamp 20.1e25183 0 0 0 > 0 98332537930 30663066 > active+clean+scrubbing2017-08-21 04:55:13.354379 > 6930'23908781 6930:20905696 [29,31,3] 29 > [29,31,3] 29 6712'22950171 2017-08-20 > 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 > > Active scrub does not finish (about 24 hours). I did not restart any > OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance, > nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this > solve my problem? Or anyone has suggestion about this problem? > > Thanks, > Ramazan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Manuel Lausch Systemadministrator Cloud Services 1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 | 76135 Karlsruhe | Germany Phone: +49 721 91374-1847 E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de Amtsgericht Montabaur, HRB 5452 Geschäftsführer: Thomas Ludwig, Jan Oetjen Member of United Internet Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden. This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests problem
Hm. That's quite weird. On our cluster, when I set "noscrub", "nodeep-scrub", scrubbing will always stop pretty quickly (a few minutes). I wonder why this doesnt happen on your cluster. When exactly did you set the flag? Perhaps it just needs some more time... Or there might be a disk problem why the scrubbing never finishes. Perhaps it's really a good idea, just like you proposed, to shutdown the corresponding OSDs. But that's just my thoughts. Perhaps some Ceph pro can shed some light on the possible reasons, why a scrubbing might get stuck and how to resolve this. Am 22.08.2017 um 18:58 schrieb Ramazan Terzi: Hi Ranjan, Thanks for your reply. I did set scrub and nodeep-scrub flags. But active scrubbing operation can’t working properly. Scrubbing operation always in same pg (20.1e). $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25189 0 0 0 0 98359116362 30483048 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'2393 6930:20949058 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 $ ceph -s cluster health HEALTH_WARN 33 requests are blocked > 32 sec noscrub,nodeep-scrub flag(s) set monmap e9: 3 mons at {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0} election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e6930: 36 osds: 36 up, 36 in flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects 70497 GB used, 127 TB / 196 TB avail 1407 active+clean 1 active+clean+scrubbing Thanks, Ramazan On 22 Aug 2017, at 18:52, Ranjan Ghoshwrote: Hi Ramazan, I'm no Ceph expert, but what I can say from my experience using Ceph is: 1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in order to relieve some stress from the machine until it recovers. 2) I usually have the following in my ceph.conf. This lets the scrubbing only run between midnight and 6 AM (hopefully the time of least demand; adjust as necessary) - and with the lowest priority. #Reduce impact of scrub. osd_disk_thread_ioprio_priority = 7 osd_disk_thread_ioprio_class = "idle" osd_scrub_end_hour = 6 3) The Scrubbing begin and end hour will always work. The low priority mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current scheduler like this (replace sda with your device): cat /sys/block/sda/queue/scheduler You can also echo to this file to set a different scheduler. With these settings you can perhaps alleviate the problem so far, that the scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have to finish in one night. It will continue the next night and so on. The Ceph experts say scrubbing is important. Don't know why, but I just believe them. They've built this complex stuff after all :-) Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to work, but you should not let it run like this forever and a day. Hope this helps at least a bit. BR, Ranjan Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default min size = 2 Ceph version is Jewel 10.2.6. My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc). During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed. Yerterday, 08:00: $ ceph health detail HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests Todat, 16:05: $ ceph health detail HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 16 ops are blocked > 134218 sec on osd.29 11 ops are blocked > 67108.9 sec on osd.29 2 ops are blocked > 16777.2
Re: [ceph-users] Blocked requests problem
Hi Ranjan, Thanks for your reply. I did set scrub and nodeep-scrub flags. But active scrubbing operation can’t working properly. Scrubbing operation always in same pg (20.1e). $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25189 0 0 0 0 98359116362 30483048 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'2393 6930:20949058 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 $ ceph -s cluster health HEALTH_WARN 33 requests are blocked > 32 sec noscrub,nodeep-scrub flag(s) set monmap e9: 3 mons at {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0} election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03 osdmap e6930: 36 osds: 36 up, 36 in flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects 70497 GB used, 127 TB / 196 TB avail 1407 active+clean 1 active+clean+scrubbing Thanks, Ramazan > On 22 Aug 2017, at 18:52, Ranjan Ghoshwrote: > > Hi Ramazan, > > I'm no Ceph expert, but what I can say from my experience using Ceph is: > > 1) During "Scrubbing", Ceph can be extremely slow. This is probably where > your "blocked requests" are coming from. BTW: Perhaps you can even find out > which processes are currently blocking with: ps aux | grep "D". You might > even want to kill some of those and/or shutdown services in order to relieve > some stress from the machine until it recovers. > > 2) I usually have the following in my ceph.conf. This lets the scrubbing only > run between midnight and 6 AM (hopefully the time of least demand; adjust as > necessary) - and with the lowest priority. > > #Reduce impact of scrub. > osd_disk_thread_ioprio_priority = 7 > osd_disk_thread_ioprio_class = "idle" > osd_scrub_end_hour = 6 > > 3) The Scrubbing begin and end hour will always work. The low priority mode, > however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current > scheduler like this (replace sda with your device): > > cat /sys/block/sda/queue/scheduler > > You can also echo to this file to set a different scheduler. > > > With these settings you can perhaps alleviate the problem so far, that the > scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt > have to finish in one night. It will continue the next night and so on. > > The Ceph experts say scrubbing is important. Don't know why, but I just > believe them. They've built this complex stuff after all :-) > > Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back > to work, but you should not let it run like this forever and a day. > > Hope this helps at least a bit. > > BR, > > Ranjan > > > Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: >> Hello, >> >> I have a Ceph Cluster with specifications below: >> 3 x Monitor node >> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have >> SSD journals) >> Distributed public and private networks. All NICs are 10Gbit/s >> osd pool default size = 3 >> osd pool default min size = 2 >> >> Ceph version is Jewel 10.2.6. >> >> My cluster is active and a lot of virtual machines running on it (Linux and >> Windows VM's, database clusters, web servers etc). >> >> During normal use, cluster slowly went into a state of blocked requests. >> Blocked requests periodically incrementing. All OSD's seems healthy. >> Benchmark, iowait, network tests, all of them succeed. >> >> Yerterday, 08:00: >> $ ceph health detail >> HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests >> 1 ops are blocked > 134218 sec on osd.31 >> 1 ops are blocked > 134218 sec on osd.3 >> 1 ops are blocked > 8388.61 sec on osd.29 >> 3 osds have slow requests >> >> Todat, 16:05: >> $ ceph health detail >> HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests >> 1 ops are blocked > 134218 sec on osd.31 >> 1 ops are blocked > 134218 sec on osd.3 >> 16 ops are blocked > 134218 sec on osd.29 >> 11 ops are blocked > 67108.9 sec on osd.29 >> 2 ops are blocked > 16777.2 sec on osd.29 >> 1 ops are blocked > 8388.61 sec on osd.29 >> 3 osds have slow requests >> >> $ ceph pg dump | grep scrub >> dumped all in format plain >> pg_stat objects mip degrmispunf bytes log disklog >> state state_stamp v reportedup up_primary >> acting acting_primary last_scrub scrub_stamp last_deep_scrub >> deep_scrub_stamp >> 20.1e25183 0 0
Re: [ceph-users] Blocked requests problem
Hi Ramazan, I'm no Ceph expert, but what I can say from my experience using Ceph is: 1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in order to relieve some stress from the machine until it recovers. 2) I usually have the following in my ceph.conf. This lets the scrubbing only run between midnight and 6 AM (hopefully the time of least demand; adjust as necessary) - and with the lowest priority. #Reduce impact of scrub. osd_disk_thread_ioprio_priority = 7 osd_disk_thread_ioprio_class = "idle" osd_scrub_end_hour = 6 3) The Scrubbing begin and end hour will always work. The low priority mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current scheduler like this (replace sda with your device): cat /sys/block/sda/queue/scheduler You can also echo to this file to set a different scheduler. With these settings you can perhaps alleviate the problem so far, that the scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have to finish in one night. It will continue the next night and so on. The Ceph experts say scrubbing is important. Don't know why, but I just believe them. They've built this complex stuff after all :-) Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to work, but you should not let it run like this forever and a day. Hope this helps at least a bit. BR, Ranjan Am 22.08.2017 um 15:20 schrieb Ramazan Terzi: Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default min size = 2 Ceph version is Jewel 10.2.6. My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc). During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed. Yerterday, 08:00: $ ceph health detail HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests Todat, 16:05: $ ceph health detail HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 16 ops are blocked > 134218 sec on osd.29 11 ops are blocked > 67108.9 sec on osd.29 2 ops are blocked > 16777.2 sec on osd.29 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25183 0 0 0 0 98332537930 30663066 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'23908781 6930:20905696 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 Active scrub does not finish (about 24 hours). I did not restart any OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has suggestion about this problem? Thanks, Ramazan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked requests problem
Hello, I have a Ceph Cluster with specifications below: 3 x Monitor node 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD journals) Distributed public and private networks. All NICs are 10Gbit/s osd pool default size = 3 osd pool default min size = 2 Ceph version is Jewel 10.2.6. My cluster is active and a lot of virtual machines running on it (Linux and Windows VM's, database clusters, web servers etc). During normal use, cluster slowly went into a state of blocked requests. Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, iowait, network tests, all of them succeed. Yerterday, 08:00: $ ceph health detail HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests Todat, 16:05: $ ceph health detail HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests 1 ops are blocked > 134218 sec on osd.31 1 ops are blocked > 134218 sec on osd.3 16 ops are blocked > 134218 sec on osd.29 11 ops are blocked > 67108.9 sec on osd.29 2 ops are blocked > 16777.2 sec on osd.29 1 ops are blocked > 8388.61 sec on osd.29 3 osds have slow requests $ ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 20.1e 25183 0 0 0 0 98332537930 30663066 active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'23908781 6930:20905696 [29,31,3] 29 [29,31,3] 29 6712'22950171 2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792 Active scrub does not finish (about 24 hours). I did not restart any OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has suggestion about this problem? Thanks, Ramazan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc: > Since I'm very interested in > reducing this problem, I'm willing to try and submit a fix after I'm > done with the new OP queue I'm working on. I don't know the best > course of action at the moment, but I hope I can get some input for > when I do try and tackle the problem next year. Is there already a ticket present for this issue in the bug tracker? I think this is an import issue. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc: > I noticed this a while back and did some tracing. As soon as the PGs > are read in by the OSD (very limited amount of housekeeping done), the > OSD is set to the "in" state so that peering with other OSDs can > happen and the recovery process can begin. The problem is that when > the OSD is "in", the clients also see that and start sending requests > to the OSDs before it has had a chance to actually get its bearings > and is able to even service the requests. After discussion with some > of the developers, there is no easy way around this other than let the > PGs recover to other OSDs and then bring in the OSDs after recovery (a > ton of data movement). Many thanks for your detailed analysis. It's a bit disappointing that there seems to be no easy way around. Any work to improve the situation is much appreciated. In the meantime, I'll be experimenting with pre-seeding the VFS cache to speed things up at least a little bit. Regards Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Are you seeing "peering" PGs when the blocked requests are happening? That's what we see regularly when starting OSDs. I'm not sure this can be solved completely (and whether there are major improvements in newer Ceph versions), but it can be sped up by 1) making sure you have free (and not dirtied or fragmented) memory on the node where you are starting the OSD - that means dropping caches before starting the OSD if you have lots of "free" RAM that is used for VFS cache 2) starting the OSDs one by one instead of booting several of them 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found it to be best to pin the OSD to a cgroup limited to one NUMA node and then limit it to a subset of cores after it has run a bit. OSD tends to use hundreds of % of CPU when booting 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd... It's unclear to me whether MONs influence this somehow (the peering stage) but I have observed their CPU usage and IO also spikes when OSDs are started, so make sure they are not under load. Jan > On 09 Dec 2015, at 11:03, Christian Kauhauswrote: > > Hi, > > I'm getting blocked requests (>30s) every time when an OSD is set to "in" in > our clusters. Once this has happened, backfills run smoothly. > > I have currently no idea where to start debugging. Has anyone a hint what to > examine first in order to narrow this issue? > > TIA > > Christian > > -- > Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 > Flying Circus Internet Operations GmbH · http://flyingcircus.io > Forsterstraße 29 · 06112 Halle (Saale) · Deutschland > HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked requests after "osd in"
Hi, I'm getting blocked requests (>30s) every time when an OSD is set to "in" in our clusters. Once this has happened, backfills run smoothly. I have currently no idea where to start debugging. Has anyone a hint what to examine first in order to narrow this issue? TIA Christian -- Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0 Flying Circus Internet Operations GmbH · http://flyingcircus.io Forsterstraße 29 · 06112 Halle (Saale) · Deutschland HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests after "osd in"
Am 09.12.2015 um 11:21 schrieb Jan Schermer: > Are you seeing "peering" PGs when the blocked requests are happening? That's > what we see regularly when starting OSDs. Mostly "peering" and "activating". > I'm not sure this can be solved completely (and whether there are major > improvements in newer Ceph versions), but it can be sped up by > 1) making sure you have free (and not dirtied or fragmented) memory on the > node where you are starting the OSD > - that means dropping caches before starting the OSD if you have lots > of "free" RAM that is used for VFS cache > 2) starting the OSDs one by one instead of booting several of them > 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found > it to be best to pin the OSD to a cgroup limited to one NUMA node and then > limit it to a subset of cores after it has run a bit. OSD tends to use > hundreds of % of CPU when booting > 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd... Thank you for your advice. The use case is not so much after rebooting a server, but more when we take OSDs in/out for maintenance. During boot, we already start them one after another with 10s pause between each pair. I've done a bit of tracing. I've kept a small cluster running with 2 "in" OSDs out of 3 and put the third one "in" at 15:06:22. From ceph.log: | 2015-12-09 15:06:22.827030 mon.0 172.20.4.6:6789/0 54964 : cluster [INF] osdmap e264345: 3 osds: 3 up, 3 in | 2015-12-09 15:06:22.828693 mon.0 172.20.4.6:6789/0 54965 : cluster [INF] pgmap v39871295: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 4515 GB / 5421 GB avail; 6406 B/s rd, 889 kB/s wr, 67 op/s | [...] | 2015-12-09 15:06:29.163793 mon.0 172.20.4.6:6789/0 54972 : cluster [INF] pgmap v39871299: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 7700 GB / 8607 GB avail After a few seconds, backfills start as expected: | 2015-12-09 15:06:24.853507 osd.3 172.20.4.40:6800/5072 778 : cluster [INF] 410.c9 restarting backfill on osd.2 from (0'0,0'0] MAX to 264336'502426 | [...] | 2015-12-09 15:06:29.874092 osd.3 172.20.4.40:6800/5072 1308 : cluster [INF] 410.d1 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1202983 | 2015-12-09 15:06:32.584907 mon.0 172.20.4.6:6789/0 54973 : cluster [INF] pgmap v39871300: 1800 pgs: 3 active+remapped+wait_backfill, 191 active+remapped, 1169 active+clean, 437 activating+remapped; 439 GB data, 906 GB used, 7700 GB / 8607 GB avail; 1725 kB/s rd, 2486 kB/s wr, 605 op/s; 23058/278796 objects misplaced (8.271%); 56612 kB/s, 14 objects/s recovering | 2015-12-09 15:06:24.851307 osd.0 172.20.4.51:6800/4919 2662 : cluster [INF] 410.c8 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1017219 | 2015-12-09 15:06:38.555243 mon.0 172.20.4.6:6789/0 54976 : cluster [INF] pgmap v39871303: 1800 pgs: 22 active+remapped+wait_backfill, 520 active+remapped, 638 active+clean, 620 activating+remapped; 439 GB data, 906 GB used, 7700 GB / 8607 | GB avail; 45289 B/s wr, 4 op/s; 64014/313904 objects misplaced (20.393%) | 2015-12-09 15:06:38.133376 osd.3 172.20.4.40:6800/5072 1309 : cluster [WRN] 9 slow requests, 9 included below; oldest blocked for > 15.306541 secs | 2015-12-09 15:06:38.133385 osd.3 172.20.4.40:6800/5072 1310 : cluster [WRN] slow request 15.305213 seconds old, received at 2015-12-09 15:06:22.828061: osd_op(client.15205073.0:35726 rbd_header.13998a74b0dc51 [watch reconnect cookie 139897352489152 gen 37] 410.937870ca ondisk+write+known_if_redirected e264345) currently reached_pg It seems that PGs in "activating" state are causing blocked requests. After a half minute or so, slow requests disappear and backfill proceeds normally: | 2015-12-09 15:06:54.139948 osd.3 172.20.4.40:6800/5072 1396 : cluster [WRN] 42 slow requests, 9 included below; oldest blocked for > 31.188267 secs | 2015-12-09 15:06:54.139957 osd.3 172.20.4.40:6800/5072 1397 : cluster [WRN] slow request 15.566440 seconds old, received at 2015-12-09 15:06:38.573403: osd_op(client.15165527.0:5878994 rbd_data.129a42ae8944a.0f2b [set-alloc-hint object_size 4194304 write_size 4194304,write 1728512~4096] 410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348) currently waiting for subops from 0,2 | 2015-12-09 15:06:54.139977 osd.3 172.20.4.40:6800/5072 1401 : cluster [WRN] slow request 15.356852 seconds old, received at 2015-12-09 15:06:38.782990: osd_op(client.15165527.0:5878997 rbd_data.129a42ae8944a.0f2b [set-alloc-hint object_size 4194304 write_size 4194304,write 1880064~4096] 410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348) currently waiting for subops from 0,2 | [...] | 2015-12-09 15:07:00.072403 mon.0 172.20.4.6:6789/0 54989 : cluster [INF] osdmap e264351: 3 osds: 3 up, 3 in | 2015-12-09 15:07:00.074536 mon.0 172.20.4.6:6789/0 54990 : cluster [INF] pgmap v39871313: 1800 pgs: 277 active+remapped+wait_backfill, 881 active+remapped, 4 active+remapped+backfilling, 638 active+clean;
Re: [ceph-users] Blocked requests after "osd in"
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I noticed this a while back and did some tracing. As soon as the PGs are read in by the OSD (very limited amount of housekeeping done), the OSD is set to the "in" state so that peering with other OSDs can happen and the recovery process can begin. The problem is that when the OSD is "in", the clients also see that and start sending requests to the OSDs before it has had a chance to actually get its bearings and is able to even service the requests. After discussion with some of the developers, there is no easy way around this other than let the PGs recover to other OSDs and then bring in the OSDs after recovery (a ton of data movement). I've suggested some options on how to work around this issue, but they all require a large amount of rework. Since I'm very interested in reducing this problem, I'm willing to try and submit a fix after I'm done with the new OP queue I'm working on. I don't know the best course of action at the moment, but I hope I can get some input for when I do try and tackle the problem next year. 1. Add a new state that allows OSDs to peer without client requests coming in. (up -> in -> active) I'm not sure if other OSDs are seen as clients, I don't think so. I'm not sure if there would have to be some trickery to make the booting OSDs not be primary until all the PGs are read and ready for I/O (not necessary recovered yet). 2. When a request comes in for a PG that is not ready, send the client a redirect message to use the primary in a previous map. I have a feeling this could be very messy and not very safe. 3. Proxy the OP on behalf of the client until the PGs are ready. The "other" OSD would have to understand that it is OK to do that write/read OP even though it is not the primary, this can be difficult to do safely. Right now I'm leaning to option #1. When the new OSD boots, keep the previous primary running and the PG is in degraded mode until the new OSD has done all of it's housekeeping and can service the IO effectively, then make a change to the CRUSH map to swap the primaries where needed. Any input and ideas from the devs would be helpful. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWaQ/tCRDmVDuy+mK58QAAt78QAIipf97avpZv+FLF0SUT F9vaUwTDI8fTpOmca1v4/nJ90pxM0RksYpg7Q+tg7+JlyQ6gns2QoKwUAf5F EgVPg6pUQXmzkKcVvUgt51NDR4d80E+xIXHSmJKT4iU3BPI5ezNHYoVlAOhm LXdDrYTaEPy/EfQxj5Prole0mLsCB129ydgPG7ud1qaNjzxLikyihLvA72Bd AZhOhvjXTXGzWR1Uw2oPStYuw2i0JrFHp9//bipa6hqHd1XJSb3afe6VW9vJ 9E3AqGXMrdZG5Nk7kjaH7MfZbsxl39KimgAcHPDBz1XK2ZrSrtNZ1nTo09+u Bb8DIB66kAT/4OIXQ1NvwTNn8INi9u14IFPzS2Z1Ewidg7jMAPkS0XxIPjhF 6G01GornpfN+emhOsQRz5sw6WPC8dlLGP9JfEP8+rPkLcNqBP82aCJ68AllZ TWelhgAJoW/LdyyCaFD87wmQ1lqQxbujcDsLaDzBLQ/vDqmw9mNTubCIKfR2 WKRft9CyDR5r/Ous16RVsy+PFhmw/e/ovrWBFLx4t/KrbQVYUCfgDZrNSLtb 4aNRUtel7PN3AXUtFM8O7gS+CaYv5fP+CotQer8HuSnL4eFGIe9yg2jHSGVy fmDFEirT3DlxFEDWja8uNFGdJ8rMYjTqOMdyOCS3SLtizTmC/+SF00kk0m9A sB8x =i9F/ -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Dec 9, 2015 at 7:33 AM, Christian Kauhauswrote: > Am 09.12.2015 um 11:21 schrieb Jan Schermer: >> Are you seeing "peering" PGs when the blocked requests are happening? That's >> what we see regularly when starting OSDs. > > Mostly "peering" and "activating". > >> I'm not sure this can be solved completely (and whether there are major >> improvements in newer Ceph versions), but it can be sped up by >> 1) making sure you have free (and not dirtied or fragmented) memory on the >> node where you are starting the OSD >> - that means dropping caches before starting the OSD if you have lots >> of "free" RAM that is used for VFS cache >> 2) starting the OSDs one by one instead of booting several of them >> 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found >> it to be best to pin the OSD to a cgroup limited to one NUMA node and then >> limit it to a subset of cores after it has run a bit. OSD tends to use >> hundreds of % of CPU when booting >> 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd... > > Thank you for your advice. The use case is not so much after rebooting a > server, but more when we take OSDs in/out for maintenance. During boot, we > already start them one after another with 10s pause between each pair. > > I've done a bit of tracing. I've kept a small cluster running with 2 "in" OSDs > out of 3 and put the third one "in" at 15:06:22. From ceph.log: > > | 2015-12-09 15:06:22.827030 mon.0 172.20.4.6:6789/0 54964 : cluster [INF] > osdmap e264345: 3 osds: 3 up, 3 in > | 2015-12-09 15:06:22.828693 mon.0 172.20.4.6:6789/0 54965 : cluster [INF] > pgmap v39871295: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 4515 > GB / 5421 GB avail; 6406 B/s rd, 889 kB/s wr, 67 op/s > | [...] > | 2015-12-09 15:06:29.163793 mon.0 172.20.4.6:6789/0 54972 :
Re: [ceph-users] Blocked requests/ops?
Hello, On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote: On Thu May 28 11:22:52 2015, Christian Balzer wrote: We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. I agree: in our particular enviroment, our tests also conclude that SSD journaling performs far better than cache-tiering, especially when cache becomes close to its capacity and data movement between cache and backing storage occurs frequently. Precisely. We also want to test if it is possible to use SSD disks as a transparent cache for the HDDs at system (Linux kernel) level, and how reliable/good is it. There are quite a number of threads about this here, some quite recent/current. They range from not worth it (i.e. about the same performance as journal SSDs) to xyz-cache destroyed my data, ate my babies and set the house on fire (i.e. massive reliability problems). Which is a pity, as in theory they look like a nice fit/addition to Ceph. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. Good point! We did not consider that, thanks for pointing it out. What CPUs do you have in those storage nodes anyway? Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo. We have only 1 CPU per osd node, so I'm afraid we have another potential bottleneck here. Oh dear, about 10GHz (that CPU is supposedly 2.4, but you may see the 2.5 because it already is in turbo mode) for 20 OSDs. Where the recommendation for HDD only OSDs is 1GHz. Fire up atop (large window so you can see all the details and devices) on one of your storage nodes. Then from a client (VM) run this: --- fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4M --iodepth=32 --- This should result in your disks (OSDs) getting busy to the point of 100% utilization, but your CPU to still have some idle (that's idle AND wait combined). If you change the blocksize to 4K (and just ctrl-c fio after 30 or so seconds) you should see a very different picture, with the CPU being much busier and the HDDs seeing less than 100% usage. That will become even more pronounced with faster HDDs and/or journal SSDs. And pure SSD clusters/pools are way above that in terms of CPU hunger. If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is kernel.pid_max, which is something you're likely to run into at some point with your dense storage nodes: http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations Christian All you say is really interesting. Thanks for your valuable advice. We surely still have plenty of things to learn and test before going to production. As long as you have the time to test out things, you'll be fine. ^_^ Christian Thanks again for your
Re: [ceph-users] Blocked requests/ops?
On Thu May 28 11:22:52 2015, Christian Balzer wrote: We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. I agree: in our particular enviroment, our tests also conclude that SSD journaling performs far better than cache-tiering, especially when cache becomes close to its capacity and data movement between cache and backing storage occurs frequently. We also want to test if it is possible to use SSD disks as a transparent cache for the HDDs at system (Linux kernel) level, and how reliable/good is it. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. Good point! We did not consider that, thanks for pointing it out. What CPUs do you have in those storage nodes anyway? Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo. We have only 1 CPU per osd node, so I'm afraid we have another potential bottleneck here. If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is kernel.pid_max, which is something you're likely to run into at some point with your dense storage nodes: http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations Christian All you say is really interesting. Thanks for your valuable advice. We surely still have plenty of things to learn and test before going to production. Thanks again for your time and help. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
Hello, Slow requests, blocked requests and blocked ops occur quite often in our cluster; too often, I'd say: several times during one day. I must say we are running some tests, but we are far from pushing the cluster to the limit (or at least, that's what I believe). Every time a blocked request/operation happened, restarting the affected OSD solved the problem. Yesterday, we wanted to see if it was possible to minimize the impact that backfills and recovery have over normal cluster performace. In our case, performance dropped from 1000 cluster IOPS (approx) to 10 IOPS (approx) when doing some kind of recovery. Thus, we reduced the parameters osd max backfills and osd recovery max active to 1 (defaults are 10 and 15, respectively). Cluster performance during recovery improved to 500-600 IOPS (approx), and overall recovery time stayed approximately the same (surprisingly). Since then, we have had no more slow/blocked requests/ops (and our tests are still running). It is soon to say this, but my guess is that osds/disks in our cluster cannot cope with all I/O: network bandwidth is not an issue (10 GbE interconnection, graphs show network usage is under control all the time), but spindles are not high-performance (WD Green). Eventually, this might lead to slow/blocked requests/ops (which shouldn't occur that often). Reducing I/O pressure caused by recovery and backfill undoubtedly helped on improving cluster performance during recovery, that was expected. But we did not expect that recovery time stayed the same... The only explanation for this is that, during recovery, there are lots of operations that fail due a timeout, are retried several times, etc. So if disks are the bottleneck, reducing such values may help as well in normal cluster operation (when propagating the replicas, for instance). And slow/blocked requests/ops do not occur (or at least, occur less frequently). Does this make sense to you? Any other thoughts? Thank you very much again for your time. - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
Hello, On Wed May 27 21:20:49 2015, Christian Balzer wrote: Hello, On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote: Hello, Slow requests, blocked requests and blocked ops occur quite often in our cluster; too often, I'd say: several times during one day. I must say we are running some tests, but we are far from pushing the cluster to the limit (or at least, that's what I believe). Every time a blocked request/operation happened, restarting the affected OSD solved the problem. You should open a bug with that description and a way to reproduce things, even if only sometimes. Having slow disks instead of an overloaded network causing permanently blocked requests definitely shouldn't happen. I totally agree. I'll try to reproduce and definitely open a bug. I'll let you know. Yesterday, we wanted to see if it was possible to minimize the impact that backfills and recovery have over normal cluster performace. In our case, performance dropped from 1000 cluster IOPS (approx) to 10 IOPS (approx) when doing some kind of recovery. Thus, we reduced the parameters osd max backfills and osd recovery max active to 1 (defaults are 10 and 15, respectively). Cluster performance during recovery improved to 500-600 IOPS (approx), and overall recovery time stayed approximately the same (surprisingly). There are some sleep values for recovery and scrub as well, these help a LOT with loaded clusters, too. Since then, we have had no more slow/blocked requests/ops (and our tests are still running). It is soon to say this, but my guess is that osds/disks in our cluster cannot cope with all I/O: network bandwidth is not an issue (10 GbE interconnection, graphs show network usage is under control all the time), but spindles are not high-performance (WD Green). Eventually, this might lead to slow/blocked requests/ops (which shouldn't occur that often). Ah yes, I was going to comment on your HDDs earlier. As Dan van der Ster at CERN will happily admit, using green, slow HDDs with Ceph (and no SSD journals) is a bad idea. You're likely to see a VAST improvement with even just 1 journal SSD (of suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of course be better. We do have SSDs, but we are not using them right now. We have 4 SSD per osd host (24 SSD at the moment). SSD model is Intel DC S3700 (400 GB). We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? Thank you very much again for your time. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC BTW, if your monitors are just used for that function, 128GB is total and utter overkill. They will be fine with 16-32GB, your storage nodes will be much better served (pagecache for hot read objects) with more RAM. And with 20 OSDs per node 32GB is pretty close to the minimum I'd recommend anyway. Reducing I/O pressure caused by recovery and backfill undoubtedly helped on improving cluster performance during recovery, that was expected. But we did not expect that recovery time stayed the same... The only explanation for this is that, during recovery, there are lots of operations that fail due a timeout, are retried several times, etc. So if disks are the bottleneck, reducing such values may help as well in normal cluster operation (when propagating the replicas, for instance). And slow/blocked requests/ops do not occur (or at least, occur less frequently). Does this make sense to you? Any other thoughts? Very much so, see above for more thoughts. Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
Hello, On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote: Hello, Slow requests, blocked requests and blocked ops occur quite often in our cluster; too often, I'd say: several times during one day. I must say we are running some tests, but we are far from pushing the cluster to the limit (or at least, that's what I believe). Every time a blocked request/operation happened, restarting the affected OSD solved the problem. You should open a bug with that description and a way to reproduce things, even if only sometimes. Having slow disks instead of an overloaded network causing permanently blocked requests definitely shouldn't happen. Yesterday, we wanted to see if it was possible to minimize the impact that backfills and recovery have over normal cluster performace. In our case, performance dropped from 1000 cluster IOPS (approx) to 10 IOPS (approx) when doing some kind of recovery. Thus, we reduced the parameters osd max backfills and osd recovery max active to 1 (defaults are 10 and 15, respectively). Cluster performance during recovery improved to 500-600 IOPS (approx), and overall recovery time stayed approximately the same (surprisingly). There are some sleep values for recovery and scrub as well, these help a LOT with loaded clusters, too. Since then, we have had no more slow/blocked requests/ops (and our tests are still running). It is soon to say this, but my guess is that osds/disks in our cluster cannot cope with all I/O: network bandwidth is not an issue (10 GbE interconnection, graphs show network usage is under control all the time), but spindles are not high-performance (WD Green). Eventually, this might lead to slow/blocked requests/ops (which shouldn't occur that often). Ah yes, I was going to comment on your HDDs earlier. As Dan van der Ster at CERN will happily admit, using green, slow HDDs with Ceph (and no SSD journals) is a bad idea. You're likely to see a VAST improvement with even just 1 journal SSD (of suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of course be better. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. BTW, if your monitors are just used for that function, 128GB is total and utter overkill. They will be fine with 16-32GB, your storage nodes will be much better served (pagecache for hot read objects) with more RAM. And with 20 OSDs per node 32GB is pretty close to the minimum I'd recommend anyway. Reducing I/O pressure caused by recovery and backfill undoubtedly helped on improving cluster performance during recovery, that was expected. But we did not expect that recovery time stayed the same... The only explanation for this is that, during recovery, there are lots of operations that fail due a timeout, are retried several times, etc. So if disks are the bottleneck, reducing such values may help as well in normal cluster operation (when propagating the replicas, for instance). And slow/blocked requests/ops do not occur (or at least, occur less frequently). Does this make sense to you? Any other thoughts? Very much so, see above for more thoughts. Christian Thank you very much again for your time. - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
On Wed, 27 May 2015 15:38:26 +0200 Xavier Serrano wrote: Hello, On Wed May 27 21:20:49 2015, Christian Balzer wrote: Hello, On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote: Hello, Slow requests, blocked requests and blocked ops occur quite often in our cluster; too often, I'd say: several times during one day. I must say we are running some tests, but we are far from pushing the cluster to the limit (or at least, that's what I believe). Every time a blocked request/operation happened, restarting the affected OSD solved the problem. You should open a bug with that description and a way to reproduce things, even if only sometimes. Having slow disks instead of an overloaded network causing permanently blocked requests definitely shouldn't happen. I totally agree. I'll try to reproduce and definitely open a bug. I'll let you know. Yesterday, we wanted to see if it was possible to minimize the impact that backfills and recovery have over normal cluster performace. In our case, performance dropped from 1000 cluster IOPS (approx) to 10 IOPS (approx) when doing some kind of recovery. Thus, we reduced the parameters osd max backfills and osd recovery max active to 1 (defaults are 10 and 15, respectively). Cluster performance during recovery improved to 500-600 IOPS (approx), and overall recovery time stayed approximately the same (surprisingly). There are some sleep values for recovery and scrub as well, these help a LOT with loaded clusters, too. Since then, we have had no more slow/blocked requests/ops (and our tests are still running). It is soon to say this, but my guess is that osds/disks in our cluster cannot cope with all I/O: network bandwidth is not an issue (10 GbE interconnection, graphs show network usage is under control all the time), but spindles are not high-performance (WD Green). Eventually, this might lead to slow/blocked requests/ops (which shouldn't occur that often). Ah yes, I was going to comment on your HDDs earlier. As Dan van der Ster at CERN will happily admit, using green, slow HDDs with Ceph (and no SSD journals) is a bad idea. You're likely to see a VAST improvement with even just 1 journal SSD (of suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of course be better. We do have SSDs, but we are not using them right now. We have 4 SSD per osd host (24 SSD at the moment). SSD model is Intel DC S3700 (400 GB). That's a nice one. ^^ We are testing different scenarios before making our final decision (cache-tiering, journaling, separate pool,...). Definitely a good idea to test things out and get an idea what Ceph and your hardware can do. From my experience and reading this ML however I think your best bet (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your 20 OSDs HDDs. Currently cache-tiering is probably the worst use for those SSD resources, though the code and strategy is of course improving. Dedicated SSD pools may be a good fit depending on your use case. However I'd advise against mixing SSD and HDD OSDs on the same node. To fully utilize those SSDs you'll need a LOT more CPU power than required by HDD OSDs or SSD journals/HDD OSDs systems. And you already have 20 OSDs in that box. What CPUs do you have in those storage nodes anyway? If you have the budget, I'd deploy the current storage nodes in classic (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD nodes, optimized for their task (more CPU power, faster network). Then use those SSD nodes to experiment with cache-tiers and pure SSD pools and switch over things when you're comfortable with this and happy with the performance. However with 20 OSDs per node, you're likely to go from a being bottlenecked by your HDDs to being CPU limited (when dealing with lots of small IOPS at least). Still, better than now for sure. This is very interesting, thanks for pointing it out! What would you suggest to use in order to identify the actual bottleneck? (disk, CPU, RAM, etc.). Tools like munin? Munin might work, I use collectd to gather all those values (and even more importantly all Ceph counters) and graphite to visualize it. For ad-hoc, on the spot analysis I really like atop (in a huge window), which will make it very clear what is going on. In addition, there are some kernel tunables that may be helpful to improve overall performance. Maybe we are filling some kernel internals and that limits our results (for instance, we had to increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per host). Which tunables should we observe? I'm no expert for large (not even medium) clusters, so you'll have to research the archives and net (the CERN Ceph slide is nice). One thing I remember is kernel.pid_max, which is something you're likely to run into at some point
Re: [ceph-users] Blocked requests/ops?
Hello, On Tue, 26 May 2015 10:00:13 -0600 Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I've seen I/O become stuck after we have done network torture tests. It seems that after so many retries that the OSD peering just gives up and doesn't retry any more. An OSD restart kicks off another round of retries and the I/O completes. It seems like there was some discussion about this on the devel list recently. While that sounds certainly plausible, the Ceph network of my cluster wasn't particular busy or tortured at that time at all. I suppose other factors might cause a similar behavior, so a good way forward would probably to ensure that retries will happen with no limitation and in a reasonable interval. As for Xavier, no I never filed a bug, that thread was all there is. Since I didn't have anything other to report than it happened and neither do you really, it is doubtful the devs can figure out what exactly caused it. So as I wrote above, probably best to make sure it keeps retrying no matter what. Christian - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, May 26, 2015 at 4:06 AM, Xavier Serrano wrote: Hello, Thanks for your detailed explanation, and for the pointer to the Unexplainable slow request thread. After investigating osd logs, disk SMART status, etc., the disk under osd.71 seems OK, so we restarted the osd... And voilà, problem seems to be solved! (or at least, the slow request message disappeared). But this really does not make me happy (and neither are you, Christian, I'm afraid). I understand that it is not acceptable that sometimes, apparently randomly, slow requests do happen and they remain stuck until an operator manually restarts the affected osd. My question now is: did you file a bug to ceph developers? What did they say? Could you provide me the links? I would like to reopen the issue if possible, and see if we can find a solution for this. About our cluster (testing, not production): - ceph version 0.94.1 - all hosts running Ubuntu 14.04 LTS 64-bits, kernel 3.16 - 5 monitors, 128GB RAM each - 6 osd hosts, 32GB RAM each, 20 osds per host, 1 HDD WD Green 2TB per osd - (and 6 more osds host to arrive soon) - 10 GbE interconnection Thank you very much indeed. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC On Tue May 26 14:19:22 2015, Christian Balzer wrote: Hello, Firstly, find my Unexplainable slow request thread in the ML archives and read all of it. On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote: Hello, We have observed that our cluster is often moving back and forth from HEALTH_OK to HEALTH_WARN states due to blocked requests. We have also observed blocked ops. For instance: As always SW versions and a detailed HW description (down to the model of HDDs used) will be helpful and educational. # ceph status cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 health HEALTH_WARN 1 requests are blocked 32 sec monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s # ceph health detail HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests 1 ops are blocked 67108.9 sec 1 ops are blocked 67108.9 sec on osd.71 1 osds have slow requests You will want to have a very close look at osd.71 (logs, internal counters, cranking up debugging), but might find it just as mysterious as my case in the thread mentioned above. My questions are: (1) Is it normal to have slow requests in a cluster? Not really, though the Ceph developers clearly think those just merits a WARNING level, whereas I would consider those a clear sign of brokenness, as VMs or other clients with those requests pending are likely to be unusable at that point. (2) Or is it a symptom that indicates that something is wrong? (for example, a disk is about to fail) That. Of course your cluster could be just at the edge of its performance and nothing but improving that (most likely by adding more nodes/OSDs) would fix that. (3) How can we fix the slow requests? Depends on cause of course. AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. (4) What's the meaning of blocked ops, and how can they be
Re: [ceph-users] Blocked requests/ops?
Hello, Thanks for your detailed explanation, and for the pointer to the Unexplainable slow request thread. After investigating osd logs, disk SMART status, etc., the disk under osd.71 seems OK, so we restarted the osd... And voilà, problem seems to be solved! (or at least, the slow request message disappeared). But this really does not make me happy (and neither are you, Christian, I'm afraid). I understand that it is not acceptable that sometimes, apparently randomly, slow requests do happen and they remain stuck until an operator manually restarts the affected osd. My question now is: did you file a bug to ceph developers? What did they say? Could you provide me the links? I would like to reopen the issue if possible, and see if we can find a solution for this. About our cluster (testing, not production): - ceph version 0.94.1 - all hosts running Ubuntu 14.04 LTS 64-bits, kernel 3.16 - 5 monitors, 128GB RAM each - 6 osd hosts, 32GB RAM each, 20 osds per host, 1 HDD WD Green 2TB per osd - (and 6 more osds host to arrive soon) - 10 GbE interconnection Thank you very much indeed. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC On Tue May 26 14:19:22 2015, Christian Balzer wrote: Hello, Firstly, find my Unexplainable slow request thread in the ML archives and read all of it. On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote: Hello, We have observed that our cluster is often moving back and forth from HEALTH_OK to HEALTH_WARN states due to blocked requests. We have also observed blocked ops. For instance: As always SW versions and a detailed HW description (down to the model of HDDs used) will be helpful and educational. # ceph status cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 health HEALTH_WARN 1 requests are blocked 32 sec monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s # ceph health detail HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests 1 ops are blocked 67108.9 sec 1 ops are blocked 67108.9 sec on osd.71 1 osds have slow requests You will want to have a very close look at osd.71 (logs, internal counters, cranking up debugging), but might find it just as mysterious as my case in the thread mentioned above. My questions are: (1) Is it normal to have slow requests in a cluster? Not really, though the Ceph developers clearly think those just merits a WARNING level, whereas I would consider those a clear sign of brokenness, as VMs or other clients with those requests pending are likely to be unusable at that point. (2) Or is it a symptom that indicates that something is wrong? (for example, a disk is about to fail) That. Of course your cluster could be just at the edge of its performance and nothing but improving that (most likely by adding more nodes/OSDs) would fix that. (3) How can we fix the slow requests? Depends on cause of course. AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. (4) What's the meaning of blocked ops, and how can they be blocked so long? (67000 seconds is more than 18 hours!) Precisely, this shouldn't happen. (5) How can we fix the blocked ops? AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I've seen I/O become stuck after we have done network torture tests. It seems that after so many retries that the OSD peering just gives up and doesn't retry any more. An OSD restart kicks off another round of retries and the I/O completes. It seems like there was some discussion about this on the devel list recently. - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, May 26, 2015 at 4:06 AM, Xavier Serrano wrote: Hello, Thanks for your detailed explanation, and for the pointer to the Unexplainable slow request thread. After investigating osd logs, disk SMART status, etc., the disk under osd.71 seems OK, so we restarted the osd... And voilà, problem seems to be solved! (or at least, the slow request message disappeared). But this really does not make me happy (and neither are you, Christian, I'm afraid). I understand that it is not acceptable that sometimes, apparently randomly, slow requests do happen and they remain stuck until an operator manually restarts the affected osd. My question now is: did you file a bug to ceph developers? What did they say? Could you provide me the links? I would like to reopen the issue if possible, and see if we can find a solution for this. About our cluster (testing, not production): - ceph version 0.94.1 - all hosts running Ubuntu 14.04 LTS 64-bits, kernel 3.16 - 5 monitors, 128GB RAM each - 6 osd hosts, 32GB RAM each, 20 osds per host, 1 HDD WD Green 2TB per osd - (and 6 more osds host to arrive soon) - 10 GbE interconnection Thank you very much indeed. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC On Tue May 26 14:19:22 2015, Christian Balzer wrote: Hello, Firstly, find my Unexplainable slow request thread in the ML archives and read all of it. On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote: Hello, We have observed that our cluster is often moving back and forth from HEALTH_OK to HEALTH_WARN states due to blocked requests. We have also observed blocked ops. For instance: As always SW versions and a detailed HW description (down to the model of HDDs used) will be helpful and educational. # ceph status cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 health HEALTH_WARN 1 requests are blocked 32 sec monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s # ceph health detail HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests 1 ops are blocked 67108.9 sec 1 ops are blocked 67108.9 sec on osd.71 1 osds have slow requests You will want to have a very close look at osd.71 (logs, internal counters, cranking up debugging), but might find it just as mysterious as my case in the thread mentioned above. My questions are: (1) Is it normal to have slow requests in a cluster? Not really, though the Ceph developers clearly think those just merits a WARNING level, whereas I would consider those a clear sign of brokenness, as VMs or other clients with those requests pending are likely to be unusable at that point. (2) Or is it a symptom that indicates that something is wrong? (for example, a disk is about to fail) That. Of course your cluster could be just at the edge of its performance and nothing but improving that (most likely by adding more nodes/OSDs) would fix that. (3) How can we fix the slow requests? Depends on cause of course. AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. (4) What's the meaning of blocked ops, and how can they be blocked so long? (67000 seconds is more than 18 hours!) Precisely, this shouldn't happen. (5) How can we fix the blocked ops? AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVZJiKCRDmVDuy+mK58QAAytMP+wUFShHEY3daOTmGLS/e
[ceph-users] Blocked requests/ops?
Hello, We have observed that our cluster is often moving back and forth from HEALTH_OK to HEALTH_WARN states due to blocked requests. We have also observed blocked ops. For instance: # ceph status cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 health HEALTH_WARN 1 requests are blocked 32 sec monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s # ceph health detail HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests 1 ops are blocked 67108.9 sec 1 ops are blocked 67108.9 sec on osd.71 1 osds have slow requests My questions are: (1) Is it normal to have slow requests in a cluster? (2) Or is it a symptom that indicates that something is wrong? (for example, a disk is about to fail) (3) How can we fix the slow requests? (4) What's the meaning of blocked ops, and how can they be blocked so long? (67000 seconds is more than 18 hours!) (5) How can we fix the blocked ops? Thank you very much for your help. Best regards, - Xavier Serrano - LCAC, Laboratori de Càlcul - Departament d'Arquitectura de Computadors, UPC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests/ops?
Hello, Firstly, find my Unexplainable slow request thread in the ML archives and read all of it. On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote: Hello, We have observed that our cluster is often moving back and forth from HEALTH_OK to HEALTH_WARN states due to blocked requests. We have also observed blocked ops. For instance: As always SW versions and a detailed HW description (down to the model of HDDs used) will be helpful and educational. # ceph status cluster 905a1185-b4f0-4664-b881-f0ad2d8be964 health HEALTH_WARN 1 requests are blocked 32 sec monmap e5: 5 mons at {ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0} election epoch 44, quorum 0,1,2,3,4 ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s # ceph health detail HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests 1 ops are blocked 67108.9 sec 1 ops are blocked 67108.9 sec on osd.71 1 osds have slow requests You will want to have a very close look at osd.71 (logs, internal counters, cranking up debugging), but might find it just as mysterious as my case in the thread mentioned above. My questions are: (1) Is it normal to have slow requests in a cluster? Not really, though the Ceph developers clearly think those just merits a WARNING level, whereas I would consider those a clear sign of brokenness, as VMs or other clients with those requests pending are likely to be unusable at that point. (2) Or is it a symptom that indicates that something is wrong? (for example, a disk is about to fail) That. Of course your cluster could be just at the edge of its performance and nothing but improving that (most likely by adding more nodes/OSDs) would fix that. (3) How can we fix the slow requests? Depends on cause of course. AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. (4) What's the meaning of blocked ops, and how can they be blocked so long? (67000 seconds is more than 18 hours!) Precisely, this shouldn't happen. (5) How can we fix the blocked ops? AFTER you exhausted all means and gotten all relevant log/performance data from osd.71 restarting the osd might be all that's needed. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] blocked requests question
hello, I have running a ceph cluster(RBD) on production environment to host 200 VMs, Under normal circumstances, ceph's performance is quite good. but when I delete a snapshot or image, ceph cluster will be appear a lot of blocked requests(generally morn than 1000), then , the whole cluster have slow down, many VMs are very slow, any idea ? than you the hardware of my cluster-- my cluster have 3 nodes,every node have 2TB sata * 10 and 120G SSD * 1___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] blocked requests question
Hello, On Mon, 4 Aug 2014 11:03:37 +0800 飞 wrote: hello, I have running a ceph cluster(RBD) on production environment to host 200 VMs, Under normal circumstances, ceph's performance is quite good. but when I delete a snapshot or image, ceph cluster will be appear a lot of blocked requests(generally morn than 1000), then , the whole cluster have slow down, many VMs are very slow, any idea ? than you the hardware of my cluster-- my cluster have 3 nodes,every node have 2TB sata * 10 and 120G SSD * 1 I suspect your cluster is pretty close to full capacity when operating normally and overwhelmed when something very intensive like an image deletion (that has to touch every last object of the image) comes along. It would be nice if operations like these would have (more and better) configuration options like with scrub (load) and recovery operations. Monitor your cluster with atop on all 3 nodes in parallel, observe the utilization of your HDDs and SSDs, CPU and network during a time of normal usage. Compare that to what you see when you delete an image (use a small one ^o^). About your cluster, what OS, Ceph version, replication factor? What CPU, memory and network configuration? A single 120GB SSD (which model?) as journal for 10 HDDs will be definitely be the limiting factor when it comes to write speed, but should handle the IOPS hopefully well enough. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] blocked requests question
hello, I have running a ceph cluster(RBD) on production environment to host 200 VMs, Under normal circumstances, ceph's performance is quite good. but when I delete a snapshot or image, ceph cluster will be appear a lot of blocked requests(generally morn than 1000), then , the whole cluster have slow down, many VMs are very slow, any idea ? than you the hardware of my cluster-- my cluster have 3 nodes,every node have 2TB sata * 10 and 120G SSD * 1___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests during and after CephFS delete
[ Re-added the list since I don't have log files. ;) ] On Mon, Dec 9, 2013 at 5:52 AM, Oliver Schulz osch...@mpp.mpg.de wrote: Hi Greg, I'll send this privately, maybe better not to post log-files, etc. to the list. :-) Nobody's reported it before, but I think the CephFS MDS is sending out too many delete requests. [...] That's all speculation on my part though; can you go sample the slow requests and see what their makeup looked like? Do you have logs from the MDS or OSDs during that time period? Uh - how do I sample the requests? I believe the slow requests should have been logged in the monitor's central log. That's a file sitting in the mon directory, and is probably accessible via other means I can't think of off-hand. Go see if it describes what the slow OSD requests are (eg, are they a bunch of MDS deletes with some other stuff sprinkled in, or all other stuff, or whatever). Concerning logs - you mean the regular ceph daemon log files? Sure - I'm attaching a tarball of all daemon logs from the relevant time interval (please don't publish them ;-) ). It's 13.2 MB, I hope it goes through by email. I also dumped ceph health every minute during the test. * 15:34:34 to 15:48:37 is the effect from my first mass delete. I aborted that one before it could finished, to see if emperor would to better By abort, you mean you stopped deleting all the things you intended to? snip -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Blocked requests during and after CephFS delete
Hello Ceph-Gurus, a short while ago I reported some trouble we had with our cluster suddenly going into a state of blocked requests. We did a few tests, and we can reproduce the problem: During / after deleting of a substantial chunk of data on CephFS (a few TB), ceph health shows blocked requests like HEALTH_WARN 222 requests are blocked 32 sec This goes on for a couple of minutes, during which the cluster is pretty much unusable. The number of blocked requests jumps around (but seems to go down on average), until finally (after about 15 minutes in my last test) health is back to OK. I upgraded the cluster to Ceph emperor (0.72.1) and repeated the test, but the problem persists. Is this normal - and if not, what might be the reason? Obviously, having the cluster go on strike for a while after data deletion is a bit of a problem, especially with a mixed application load. The VM's running on RBDs aren't too happy about it, for example. ;-) Our cluster structure: 6 Nodes, 6x 3TB disks plus 1x System/Journal SSD per node, one OSD per disk. We're running ceph version 0.72.1-1precise on Ubuntu 12.04.3 with kernel 3.8.0-33-generic (x86_64). All active pools use replication factor 3. Any ideas? Cheers, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked requests during and after CephFS delete
On Sun, Dec 8, 2013 at 7:16 AM, Oliver Schulz osch...@mpp.mpg.de wrote: Hello Ceph-Gurus, a short while ago I reported some trouble we had with our cluster suddenly going into a state of blocked requests. We did a few tests, and we can reproduce the problem: During / after deleting of a substantial chunk of data on CephFS (a few TB), ceph health shows blocked requests like HEALTH_WARN 222 requests are blocked 32 sec This goes on for a couple of minutes, during which the cluster is pretty much unusable. The number of blocked requests jumps around (but seems to go down on average), until finally (after about 15 minutes in my last test) health is back to OK. I upgraded the cluster to Ceph emperor (0.72.1) and repeated the test, but the problem persists. Is this normal - and if not, what might be the reason? Obviously, having the cluster go on strike for a while after data deletion is a bit of a problem, especially with a mixed application load. The VM's running on RBDs aren't too happy about it, for example. ;-) Nobody's reported it before, but I think the CephFS MDS is sending out too many delete requests. When you delete something in CephFS, it's just marked as deleted and the MDS is supposed to do so asynchronously in the background, but I'm not sure if there are any throttles on how quickly it does so. If you remove several terabytes worth of data, and the MDS is sending out RADOS object deletes for each 4MB as fast as it can, that's a lot of unfiltered traffic on the OSDs. That's all speculation on my part though; can you go sample the slow requests and see what their makeup looked like? Do you have logs from the MDS or OSDs during that time period? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com