Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Robert LeBlanc
On Tue, Sep 24, 2019 at 4:33 AM Thomas <74cmo...@gmail.com> wrote:
>
> Hi,
>
> I'm experiencing the same issue with this setting in ceph.conf:
> osd op queue = wpq
> osd op queue cut off = high
>
> Furthermore I cannot read any old data in the relevant pool that is
> serving CephFS.
> However, I can write new data and read this new data.

If you restarted all the OSDs with this setting, it won't necessarily
prevent any blocked IO, it just really helps prevent the really long
blocked IO and makes sure that IO is eventually done in a more fair
manner.

It sounds like you may have some MDS issues that are deeper than my
understanding. First thing I'd try is to bounce the MDS service.

> > If I want to add this my ceph-ansible playbook parameters, in which files I 
> > should add it and what is the best way to do it ?
> >
> > Add those 3 lines in all.yml or osds.yml ?
> >
> > ceph_conf_overrides:
> >   global:
> > osd_op_queue_cut_off: high
> >
> > Is there another (better?) way to do that?

I can't speak to either of those approaches. I wanted all my config in
a single file, so I put it in my inventory file, but it looks like you
have the right idea.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Thomas
Hi,

I'm experiencing the same issue with this setting in ceph.conf:
    osd op queue = wpq
    osd op queue cut off = high

Furthermore I cannot read any old data in the relevant pool that is
serving CephFS.
However, I can write new data and read this new data.

Regards
Thomas

Am 24.09.2019 um 10:24 schrieb Yoann Moulin:
> Hello,
>
>>> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
>>> (no SSD) in 20 servers.
>>>
>>> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow 
>>> requests, 0 included below; oldest blocked for > 60281.199503 secs"
>>>
>>> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
>>> memory, up to 130GB RSS each. It this value normal? May this related to
>>> slow requests? Is disk only increasing the probability to get slow requests?
>> If you haven't set:
>>
>> osd op queue cut off = high
>>
>> in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
>> help quite a bit with pure HDD clusters.
> OK I'll try this, thanks.
>
> If I want to add this my ceph-ansible playbook parameters, in which files I 
> should add it and what is the best way to do it ?
>
> Add those 3 lines in all.yml or osds.yml ?
>
> ceph_conf_overrides:
>   global:
> osd_op_queue_cut_off: high
>
> Is there another (better?) way to do that?
>
> Thanks for your help.
>
> Best regards,
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Yoann Moulin
Hello,

>> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
>> (no SSD) in 20 servers.
>>
>> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow 
>> requests, 0 included below; oldest blocked for > 60281.199503 secs"
>>
>> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
>> memory, up to 130GB RSS each. It this value normal? May this related to
>> slow requests? Is disk only increasing the probability to get slow requests?
>
> If you haven't set:
> 
> osd op queue cut off = high
> 
> in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
> help quite a bit with pure HDD clusters.

OK I'll try this, thanks.

If I want to add this my ceph-ansible playbook parameters, in which files I 
should add it and what is the best way to do it ?

Add those 3 lines in all.yml or osds.yml ?

ceph_conf_overrides:
  global:
osd_op_queue_cut_off: high

Is there another (better?) way to do that?

Thanks for your help.

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-23 Thread Robert LeBlanc
On Thu, Sep 19, 2019 at 2:36 AM Yoann Moulin  wrote:
>
> Hello,
>
> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
> (no SSD) in 20 servers.
>
> >   cluster:
> > id: 778234df-5784-4021-b983-0ee1814891be
> > health: HEALTH_WARN
> > 2 MDSs report slow requests
> >
> >   services:
> > mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d)
> > mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006
> > mds: cephfs:3 
> > {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active}
> > osd: 40 osds: 40 up (since 2w), 40 in (since 3w)
> >
> >   data:
> > pools:   3 pools, 672 pgs
> > objects: 36.08M objects, 19 TiB
> > usage:   51 TiB used, 15 TiB / 65 TiB avail
> > pgs: 670 active+clean
> >  2   active+clean+scrubbing
>
> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 
> 0 included below; oldest blocked for > 60281.199503 secs"
>
> > HEALTH_WARN 2 MDSs report slow requests
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> > mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs
> > mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs
>
> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
> memory, up to 130GB RSS each. It this value normal? May this related to
> slow requests? Is disk only increasing the probability to get slow requests?
>
> > USER PID %CPU %MEM   VSZ   RSS TTY STAT STAR   TIME COMMAND
> > ceph   34196  3.6 35.0 156247524 138521572 ? Ssl  Jul01 4173:18 
> > /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph
> > ceph   34394  3.6 35.0 160001436 138487776 ? Ssl  Jul01 4178:37 
> > /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph
> > ceph   34709  3.5 35.1 156369636 138752044 ? Ssl  Jul01 4088:57 
> > /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph
> > ceph   34915  3.4 35.1 158976936 138715900 ? Ssl  Jul01 3950:45 
> > /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph
> > ceph   34156  3.4 35.1 158280768 138714484 ? Ssl  Jul01 3984:11 
> > /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph
> > ceph   34378  3.7 35.1 155162420 138708096 ? Ssl  Jul01 4312:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph
> > ceph   34161  3.5 35.0 159606788 138523652 ? Ssl  Jul01 4128:17 
> > /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph
> > ceph   34380  3.6 35.1 161465372 138670168 ? Ssl  Jul01 4238:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph
> > ceph   33822  3.7 35.1 163456644 138734036 ? Ssl  Jul01 4342:05 
> > /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph
> > ceph   34003  3.8 35.0 161868584 138531208 ? Ssl  Jul01 4427:32 
> > /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph
> > ceph9753  2.8 24.2 96923856 95580776 ?   Ssl  Sep02 700:25 
> > /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph
> > ceph   10120  2.5 24.0 96130340 94856244 ?   Ssl  Sep02 644:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph
> > ceph   36204  3.6 35.0 159394476 138592124 ? Ssl  Jul01 4185:36 
> > /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph
> > ceph   36427  3.7 34.4 155699060 136076432 ? Ssl  Jul01 4298:26 
> > /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph
> > ceph   36622  4.1 35.1 158219408 138724688 ? Ssl  Jul01 4779:14 
> > /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph
> > ceph   36881  4.0 35.1 157748752 138719064 ? Ssl  Jul01 4669:54 
> > /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph
> > ceph   34649  3.7 35.1 159601580 138652012 ? Ssl  Jul01 4337:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph
> > ceph   34881  3.8 35.1 158632412 138764376 ? Ssl  Jul01 4433:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph
> > ceph   34646  4.2 35.1 155029328 138732376 ? Ssl  Jul01 4831:24 
> > /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph
> > ceph   34881  4.1 35.1 156801676 138763588 ? Ssl  Jul01 4710:19 
> > /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph
> > ceph   36766  3.7 35.1 158070740 138703240 ? Ssl  Jul01 4341:42 
> > /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph
> > ceph   37013  3.5 35.0 157767668 138272248 ? Ssl  Jul01 4094:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph
> > ceph   35007  3.4 35.1 160318780 138756404 ? Ssl  Jul01 3963:21 

[ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-19 Thread Yoann Moulin
Hello,

I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk (no 
SSD) in 20 servers.

>   cluster:
> id: 778234df-5784-4021-b983-0ee1814891be
> health: HEALTH_WARN
> 2 MDSs report slow requests
>  
>   services:
> mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d)
> mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006
> mds: cephfs:3 
> {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active}
> osd: 40 osds: 40 up (since 2w), 40 in (since 3w)
>  
>   data:
> pools:   3 pools, 672 pgs
> objects: 36.08M objects, 19 TiB
> usage:   51 TiB used, 15 TiB / 65 TiB avail
> pgs: 670 active+clean
>  2   active+clean+scrubbing

I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 0 
included below; oldest blocked for > 60281.199503 secs"

> HEALTH_WARN 2 MDSs report slow requests
> MDS_SLOW_REQUEST 2 MDSs report slow requests
> mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs
> mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs

After a few investigations, I saw that ALL ceph-osd process eat a lot of 
memory, up to 130GB RSS each. It this value normal? May this related to
slow requests? Is disk only increasing the probability to get slow requests?

> USER PID %CPU %MEM   VSZ   RSS TTY STAT STAR   TIME COMMAND
> ceph   34196  3.6 35.0 156247524 138521572 ? Ssl  Jul01 4173:18 
> /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph
> ceph   34394  3.6 35.0 160001436 138487776 ? Ssl  Jul01 4178:37 
> /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph
> ceph   34709  3.5 35.1 156369636 138752044 ? Ssl  Jul01 4088:57 
> /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph
> ceph   34915  3.4 35.1 158976936 138715900 ? Ssl  Jul01 3950:45 
> /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph
> ceph   34156  3.4 35.1 158280768 138714484 ? Ssl  Jul01 3984:11 
> /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph
> ceph   34378  3.7 35.1 155162420 138708096 ? Ssl  Jul01 4312:12 
> /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph
> ceph   34161  3.5 35.0 159606788 138523652 ? Ssl  Jul01 4128:17 
> /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph
> ceph   34380  3.6 35.1 161465372 138670168 ? Ssl  Jul01 4238:20 
> /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph
> ceph   33822  3.7 35.1 163456644 138734036 ? Ssl  Jul01 4342:05 
> /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph
> ceph   34003  3.8 35.0 161868584 138531208 ? Ssl  Jul01 4427:32 
> /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph
> ceph9753  2.8 24.2 96923856 95580776 ?   Ssl  Sep02 700:25 
> /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph
> ceph   10120  2.5 24.0 96130340 94856244 ?   Ssl  Sep02 644:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph
> ceph   36204  3.6 35.0 159394476 138592124 ? Ssl  Jul01 4185:36 
> /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph
> ceph   36427  3.7 34.4 155699060 136076432 ? Ssl  Jul01 4298:26 
> /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph
> ceph   36622  4.1 35.1 158219408 138724688 ? Ssl  Jul01 4779:14 
> /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph
> ceph   36881  4.0 35.1 157748752 138719064 ? Ssl  Jul01 4669:54 
> /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph
> ceph   34649  3.7 35.1 159601580 138652012 ? Ssl  Jul01 4337:20 
> /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph
> ceph   34881  3.8 35.1 158632412 138764376 ? Ssl  Jul01 4433:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph
> ceph   34646  4.2 35.1 155029328 138732376 ? Ssl  Jul01 4831:24 
> /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph
> ceph   34881  4.1 35.1 156801676 138763588 ? Ssl  Jul01 4710:19 
> /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph
> ceph   36766  3.7 35.1 158070740 138703240 ? Ssl  Jul01 4341:42 
> /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph
> ceph   37013  3.5 35.0 157767668 138272248 ? Ssl  Jul01 4094:12 
> /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph
> ceph   35007  3.4 35.1 160318780 138756404 ? Ssl  Jul01 3963:21 
> /usr/bin/ceph-osd -f --cluster apollo --id 2 --setuser ceph --setgroup ceph
> ceph   35217  3.5 35.1 159023744 138626680 ? Ssl  Jul01 4041:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 22 --setuser