Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-11 Thread BASSAGET Cédric
Hello Robert,
I did not make any changes, so I'm still using the prio queue.
Regards

Le lun. 10 juin 2019 à 17:44, Robert LeBlanc  a
écrit :

> I'm glad it's working, to be clear did you use wpq, or is it still the
> prio queue?
>
> Sent from a mobile device, please excuse any typos.
>
> On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric <
> cedric.bassaget...@gmail.com> wrote:
>
>> an update from 12.2.9 to 12.2.12 seems to have fixed the problem !
>>
>> Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric <
>> cedric.bassaget...@gmail.com> a écrit :
>>
>>> Hi Robert,
>>> Before doing anything on my prod env, I generate r/w on ceph cluster
>>> using fio .
>>> On my newest cluster, release 12.2.12, I did not manage to get
>>> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio
>>> ran from 4 diffrent hosts)
>>>
>>> On my prod cluster, release 12.2.9, as soon as I run fio on a single
>>> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1"
>>> does not show me a usage more that 5-10% on disks...
>>>
>>> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc  a
>>> écrit :
>>>
>>>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric <
>>>> cedric.bassaget...@gmail.com> wrote:
>>>>
>>>>> Hello Robert,
>>>>> My disks did not reach 100% on the last warning, they climb to 70-80%
>>>>> usage. But I see rrqm / wrqm counters increasing...
>>>>>
>>>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>
>>>>> sda   0.00 4.000.00   16.00 0.00   104.00
>>>>>  13.00 0.000.000.000.00   0.00   0.00
>>>>> sdb   0.00 2.001.00 3456.00 8.00 25996.00
>>>>>  15.04 5.761.670.001.67   0.03   9.20
>>>>> sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
>>>>>  15.9419.890.470.480.21   0.02  66.00
>>>>>
>>>>> dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.00
>>>>>  92.48 4.000.550.560.30   0.09  66.80
>>>>> dm-1  0.00 0.001.00 1129.00 8.00 25996.00
>>>>>  46.02 1.030.910.000.91   0.09  10.00
>>>>>
>>>>>
>>>>> sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd
>>>>> are my OSDs
>>>>>
>>>>> would "osd op queue = wpq" help in this case ?
>>>>> Regards
>>>>>
>>>>
>>>> Your disk times look okay, just a lot more unbalanced than I would
>>>> expect. I'd give wpq a try, I use it all the time, just be sure to also
>>>> include the op_cutoff setting too or it doesn't have much effect. Let me
>>>> know how it goes.
>>>> 
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-10 Thread BASSAGET Cédric
an update from 12.2.9 to 12.2.12 seems to have fixed the problem !

Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric 
a écrit :

> Hi Robert,
> Before doing anything on my prod env, I generate r/w on ceph cluster using
> fio .
> On my newest cluster, release 12.2.12, I did not manage to get
> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio
> ran from 4 diffrent hosts)
>
> On my prod cluster, release 12.2.9, as soon as I run fio on a single host,
> I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not
> show me a usage more that 5-10% on disks...
>
> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc  a
> écrit :
>
>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric <
>> cedric.bassaget...@gmail.com> wrote:
>>
>>> Hello Robert,
>>> My disks did not reach 100% on the last warning, they climb to 70-80%
>>> usage. But I see rrqm / wrqm counters increasing...
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>
>>> sda   0.00 4.000.00   16.00 0.00   104.00
>>>  13.00 0.000.000.000.00   0.00   0.00
>>> sdb   0.00 2.001.00 3456.00 8.00 25996.00
>>>  15.04 5.761.670.001.67   0.03   9.20
>>> sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
>>>  15.9419.890.470.480.21   0.02  66.00
>>>
>>> dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.00
>>>  92.48 4.000.550.560.30   0.09  66.80
>>> dm-1  0.00 0.001.00 1129.00 8.00 25996.00
>>>  46.02 1.030.910.000.91   0.09  10.00
>>>
>>>
>>> sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd
>>> are my OSDs
>>>
>>> would "osd op queue = wpq" help in this case ?
>>> Regards
>>>
>>
>> Your disk times look okay, just a lot more unbalanced than I would
>> expect. I'd give wpq a try, I use it all the time, just be sure to also
>> include the op_cutoff setting too or it doesn't have much effect. Let me
>> know how it goes.
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-10 Thread BASSAGET Cédric
Hi Robert,
Before doing anything on my prod env, I generate r/w on ceph cluster using
fio .
On my newest cluster, release 12.2.12, I did not manage to get
the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio
ran from 4 diffrent hosts)

On my prod cluster, release 12.2.9, as soon as I run fio on a single host,
I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not
show me a usage more that 5-10% on disks...

Le lun. 10 juin 2019 à 10:12, Robert LeBlanc  a
écrit :

> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric <
> cedric.bassaget...@gmail.com> wrote:
>
>> Hello Robert,
>> My disks did not reach 100% on the last warning, they climb to 70-80%
>> usage. But I see rrqm / wrqm counters increasing...
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>
>> sda   0.00 4.000.00   16.00 0.00   104.00
>>  13.00 0.000.000.000.00   0.00   0.00
>> sdb   0.00 2.001.00 3456.00 8.00 25996.00
>>  15.04 5.761.670.001.67   0.03   9.20
>> sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
>>  15.9419.890.470.480.21   0.02  66.00
>>
>> dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.00
>>  92.48 4.000.550.560.30   0.09  66.80
>> dm-1  0.00 0.001.00 1129.00 8.00 25996.00
>>  46.02 1.030.910.000.91   0.09  10.00
>>
>>
>> sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd are
>> my OSDs
>>
>> would "osd op queue = wpq" help in this case ?
>> Regards
>>
>
> Your disk times look okay, just a lot more unbalanced than I would expect.
> I'd give wpq a try, I use it all the time, just be sure to also include the
> op_cutoff setting too or it doesn't have much effect. Let me know how it
> goes.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-10 Thread BASSAGET Cédric
Hello Robert,
My disks did not reach 100% on the last warning, they climb to 70-80%
usage. But I see rrqm / wrqm counters increasing...

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util

sda   0.00 4.000.00   16.00 0.00   104.0013.00
0.000.000.000.00   0.00   0.00
sdb   0.00 2.001.00 3456.00 8.00 25996.0015.04
5.761.670.001.67   0.03   9.20
sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
 15.9419.890.470.480.21   0.02  66.00

dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.0092.48
4.000.550.560.30   0.09  66.80
dm-1  0.00 0.001.00 1129.00 8.00 25996.0046.02
1.030.910.000.91   0.09  10.00


sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd are
my OSDs

would "osd op queue = wpq" help in this case ?
Regards

Le sam. 8 juin 2019 à 07:44, Robert LeBlanc  a écrit :

> With the low number of OSDs, you are probably satuarting the disks. Check
> with `iostat -xd 2` and see what the utilization of your disks are. A lot
> of SSDs don't perform well with Ceph's heavy sync writes and performance is
> terrible.
>
> If some of your drives are 100% while others are lower utilization, you
> can possibly get more performance and greatly reduce the blocked I/O with
> the WPQ scheduler. In the ceph.conf add this to the [osd] section and
> restart the processes:
>
> osd op queue = wpq
> osd op queue cut off = high
>
> This has helped our clusters with fairness between OSDs and making
> backfills not so disruptive.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric <
> cedric.bassaget...@gmail.com> wrote:
>
>> Hello,
>>
>> I see messages related to REQUEST_SLOW a few times per day.
>>
>> here's my ceph -s  :
>>
>> root@ceph-pa2-1:/etc/ceph# ceph -s
>>   cluster:
>> id: 72d94815-f057-4127-8914-448dfd25f5bc
>> health: HEALTH_OK
>>
>>   services:
>> mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3
>> mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2
>> osd: 6 osds: 6 up, 6 in
>>
>>   data:
>> pools:   1 pools, 256 pgs
>> objects: 408.79k objects, 1.49TiB
>> usage:   4.44TiB used, 37.5TiB / 41.9TiB avail
>> pgs: 256 active+clean
>>
>>   io:
>> client:   8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr
>>
>>
>> Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
>> luminous (stable)
>>
>> I've check :
>> - all my network stack : OK ( 2*10G LAG )
>> - memory usage : ok (256G on each host, about 2% used per osd)
>> - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz)
>> - disk status : OK (SAMSUNG   AREA7680S5xnNTRI  3P04 => samsung DC series)
>>
>> I heard on IRC that it can be related to samsung PM / SM series.
>>
>> Do anybody here is facing the same problem ? What can I do to solve that ?
>> Regards,
>> Cédric
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-06 Thread BASSAGET Cédric
Hello,

I see messages related to REQUEST_SLOW a few times per day.

here's my ceph -s  :

root@ceph-pa2-1:/etc/ceph# ceph -s
  cluster:
id: 72d94815-f057-4127-8914-448dfd25f5bc
health: HEALTH_OK

  services:
mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3
mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2
osd: 6 osds: 6 up, 6 in

  data:
pools:   1 pools, 256 pgs
objects: 408.79k objects, 1.49TiB
usage:   4.44TiB used, 37.5TiB / 41.9TiB avail
pgs: 256 active+clean

  io:
client:   8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr


Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
luminous (stable)

I've check :
- all my network stack : OK ( 2*10G LAG )
- memory usage : ok (256G on each host, about 2% used per osd)
- cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz)
- disk status : OK (SAMSUNG   AREA7680S5xnNTRI  3P04 => samsung DC series)

I heard on IRC that it can be related to samsung PM / SM series.

Do anybody here is facing the same problem ? What can I do to solve that ?
Regards,
Cédric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com