Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hello Robert, I did not make any changes, so I'm still using the prio queue. Regards Le lun. 10 juin 2019 à 17:44, Robert LeBlanc a écrit : > I'm glad it's working, to be clear did you use wpq, or is it still the > prio queue? > > Sent from a mobile device, please excuse any typos. > > On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! >> >> Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric < >> cedric.bassaget...@gmail.com> a écrit : >> >>> Hi Robert, >>> Before doing anything on my prod env, I generate r/w on ceph cluster >>> using fio . >>> On my newest cluster, release 12.2.12, I did not manage to get >>> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio >>> ran from 4 diffrent hosts) >>> >>> On my prod cluster, release 12.2.9, as soon as I run fio on a single >>> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" >>> does not show me a usage more that 5-10% on disks... >>> >>> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a >>> écrit : >>> >>>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < >>>> cedric.bassaget...@gmail.com> wrote: >>>> >>>>> Hello Robert, >>>>> My disks did not reach 100% on the last warning, they climb to 70-80% >>>>> usage. But I see rrqm / wrqm counters increasing... >>>>> >>>>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>> >>>>> sda 0.00 4.000.00 16.00 0.00 104.00 >>>>> 13.00 0.000.000.000.00 0.00 0.00 >>>>> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >>>>> 15.04 5.761.670.001.67 0.03 9.20 >>>>> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >>>>> 15.9419.890.470.480.21 0.02 66.00 >>>>> >>>>> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >>>>> 92.48 4.000.550.560.30 0.09 66.80 >>>>> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >>>>> 46.02 1.030.910.000.91 0.09 10.00 >>>>> >>>>> >>>>> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd >>>>> are my OSDs >>>>> >>>>> would "osd op queue = wpq" help in this case ? >>>>> Regards >>>>> >>>> >>>> Your disk times look okay, just a lot more unbalanced than I would >>>> expect. I'd give wpq a try, I use it all the time, just be sure to also >>>> include the op_cutoff setting too or it doesn't have much effect. Let me >>>> know how it goes. >>>> >>>> Robert LeBlanc >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>>> >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric a écrit : > Hi Robert, > Before doing anything on my prod env, I generate r/w on ceph cluster using > fio . > On my newest cluster, release 12.2.12, I did not manage to get > the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio > ran from 4 diffrent hosts) > > On my prod cluster, release 12.2.9, as soon as I run fio on a single host, > I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not > show me a usage more that 5-10% on disks... > > Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a > écrit : > >> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < >> cedric.bassaget...@gmail.com> wrote: >> >>> Hello Robert, >>> My disks did not reach 100% on the last warning, they climb to 70-80% >>> usage. But I see rrqm / wrqm counters increasing... >>> >>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> sda 0.00 4.000.00 16.00 0.00 104.00 >>> 13.00 0.000.000.000.00 0.00 0.00 >>> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >>> 15.04 5.761.670.001.67 0.03 9.20 >>> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >>> 15.9419.890.470.480.21 0.02 66.00 >>> >>> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >>> 92.48 4.000.550.560.30 0.09 66.80 >>> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >>> 46.02 1.030.910.000.91 0.09 10.00 >>> >>> >>> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd >>> are my OSDs >>> >>> would "osd op queue = wpq" help in this case ? >>> Regards >>> >> >> Your disk times look okay, just a lot more unbalanced than I would >> expect. I'd give wpq a try, I use it all the time, just be sure to also >> include the op_cutoff setting too or it doesn't have much effect. Let me >> know how it goes. >> >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hi Robert, Before doing anything on my prod env, I generate r/w on ceph cluster using fio . On my newest cluster, release 12.2.12, I did not manage to get the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio ran from 4 diffrent hosts) On my prod cluster, release 12.2.9, as soon as I run fio on a single host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not show me a usage more that 5-10% on disks... Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a écrit : > On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> Hello Robert, >> My disks did not reach 100% on the last warning, they climb to 70-80% >> usage. But I see rrqm / wrqm counters increasing... >> >> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> >> sda 0.00 4.000.00 16.00 0.00 104.00 >> 13.00 0.000.000.000.00 0.00 0.00 >> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >> 15.04 5.761.670.001.67 0.03 9.20 >> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >> 15.9419.890.470.480.21 0.02 66.00 >> >> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >> 92.48 4.000.550.560.30 0.09 66.80 >> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >> 46.02 1.030.910.000.91 0.09 10.00 >> >> >> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are >> my OSDs >> >> would "osd op queue = wpq" help in this case ? >> Regards >> > > Your disk times look okay, just a lot more unbalanced than I would expect. > I'd give wpq a try, I use it all the time, just be sure to also include the > op_cutoff setting too or it doesn't have much effect. Let me know how it > goes. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hello Robert, My disks did not reach 100% on the last warning, they climb to 70-80% usage. But I see rrqm / wrqm counters increasing... Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.000.00 16.00 0.00 104.0013.00 0.000.000.000.00 0.00 0.00 sdb 0.00 2.001.00 3456.00 8.00 25996.0015.04 5.761.670.001.67 0.03 9.20 sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 15.9419.890.470.480.21 0.02 66.00 dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.0092.48 4.000.550.560.30 0.09 66.80 dm-1 0.00 0.001.00 1129.00 8.00 25996.0046.02 1.030.910.000.91 0.09 10.00 sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are my OSDs would "osd op queue = wpq" help in this case ? Regards Le sam. 8 juin 2019 à 07:44, Robert LeBlanc a écrit : > With the low number of OSDs, you are probably satuarting the disks. Check > with `iostat -xd 2` and see what the utilization of your disks are. A lot > of SSDs don't perform well with Ceph's heavy sync writes and performance is > terrible. > > If some of your drives are 100% while others are lower utilization, you > can possibly get more performance and greatly reduce the blocked I/O with > the WPQ scheduler. In the ceph.conf add this to the [osd] section and > restart the processes: > > osd op queue = wpq > osd op queue cut off = high > > This has helped our clusters with fairness between OSDs and making > backfills not so disruptive. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> Hello, >> >> I see messages related to REQUEST_SLOW a few times per day. >> >> here's my ceph -s : >> >> root@ceph-pa2-1:/etc/ceph# ceph -s >> cluster: >> id: 72d94815-f057-4127-8914-448dfd25f5bc >> health: HEALTH_OK >> >> services: >> mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3 >> mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2 >> osd: 6 osds: 6 up, 6 in >> >> data: >> pools: 1 pools, 256 pgs >> objects: 408.79k objects, 1.49TiB >> usage: 4.44TiB used, 37.5TiB / 41.9TiB avail >> pgs: 256 active+clean >> >> io: >> client: 8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr >> >> >> Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) >> luminous (stable) >> >> I've check : >> - all my network stack : OK ( 2*10G LAG ) >> - memory usage : ok (256G on each host, about 2% used per osd) >> - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz) >> - disk status : OK (SAMSUNG AREA7680S5xnNTRI 3P04 => samsung DC series) >> >> I heard on IRC that it can be related to samsung PM / SM series. >> >> Do anybody here is facing the same problem ? What can I do to solve that ? >> Regards, >> Cédric >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hello, I see messages related to REQUEST_SLOW a few times per day. here's my ceph -s : root@ceph-pa2-1:/etc/ceph# ceph -s cluster: id: 72d94815-f057-4127-8914-448dfd25f5bc health: HEALTH_OK services: mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3 mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2 osd: 6 osds: 6 up, 6 in data: pools: 1 pools, 256 pgs objects: 408.79k objects, 1.49TiB usage: 4.44TiB used, 37.5TiB / 41.9TiB avail pgs: 256 active+clean io: client: 8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable) I've check : - all my network stack : OK ( 2*10G LAG ) - memory usage : ok (256G on each host, about 2% used per osd) - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz) - disk status : OK (SAMSUNG AREA7680S5xnNTRI 3P04 => samsung DC series) I heard on IRC that it can be related to samsung PM / SM series. Do anybody here is facing the same problem ? What can I do to solve that ? Regards, Cédric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com