On Thu, Sep 19, 2019 at 2:36 AM Yoann Moulin <yoann.mou...@epfl.ch> wrote:
>
> Hello,
>
> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
> (no SSD) in 20 servers.
>
> >   cluster:
> >     id:     778234df-5784-4021-b983-0ee1814891be
> >     health: HEALTH_WARN
> >             2 MDSs report slow requests
> >
> >   services:
> >     mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d)
> >     mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006
> >     mds: cephfs:3 
> > {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active}
> >     osd: 40 osds: 40 up (since 2w), 40 in (since 3w)
> >
> >   data:
> >     pools:   3 pools, 672 pgs
> >     objects: 36.08M objects, 19 TiB
> >     usage:   51 TiB used, 15 TiB / 65 TiB avail
> >     pgs:     670 active+clean
> >              2   active+clean+scrubbing
>
> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 
> 0 included below; oldest blocked for > 60281.199503 secs"
>
> > HEALTH_WARN 2 MDSs report slow requests
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> >     mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs
> >     mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs
>
> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
> memory, up to 130GB RSS each. It this value normal? May this related to
> slow requests? Is disk only increasing the probability to get slow requests?
>
> > USER         PID %CPU %MEM       VSZ       RSS TTY STAT STAR   TIME COMMAND
> > ceph       34196  3.6 35.0 156247524 138521572 ? Ssl  Jul01 4173:18 
> > /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph
> > ceph       34394  3.6 35.0 160001436 138487776 ? Ssl  Jul01 4178:37 
> > /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph
> > ceph       34709  3.5 35.1 156369636 138752044 ? Ssl  Jul01 4088:57 
> > /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph
> > ceph       34915  3.4 35.1 158976936 138715900 ? Ssl  Jul01 3950:45 
> > /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph
> > ceph       34156  3.4 35.1 158280768 138714484 ? Ssl  Jul01 3984:11 
> > /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph
> > ceph       34378  3.7 35.1 155162420 138708096 ? Ssl  Jul01 4312:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph
> > ceph       34161  3.5 35.0 159606788 138523652 ? Ssl  Jul01 4128:17 
> > /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph
> > ceph       34380  3.6 35.1 161465372 138670168 ? Ssl  Jul01 4238:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph
> > ceph       33822  3.7 35.1 163456644 138734036 ? Ssl  Jul01 4342:05 
> > /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph
> > ceph       34003  3.8 35.0 161868584 138531208 ? Ssl  Jul01 4427:32 
> > /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph
> > ceph        9753  2.8 24.2 96923856 95580776 ?   Ssl  Sep02 700:25 
> > /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph
> > ceph       10120  2.5 24.0 96130340 94856244 ?   Ssl  Sep02 644:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph
> > ceph       36204  3.6 35.0 159394476 138592124 ? Ssl  Jul01 4185:36 
> > /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph
> > ceph       36427  3.7 34.4 155699060 136076432 ? Ssl  Jul01 4298:26 
> > /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph
> > ceph       36622  4.1 35.1 158219408 138724688 ? Ssl  Jul01 4779:14 
> > /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph
> > ceph       36881  4.0 35.1 157748752 138719064 ? Ssl  Jul01 4669:54 
> > /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph
> > ceph       34649  3.7 35.1 159601580 138652012 ? Ssl  Jul01 4337:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph
> > ceph       34881  3.8 35.1 158632412 138764376 ? Ssl  Jul01 4433:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph
> > ceph       34646  4.2 35.1 155029328 138732376 ? Ssl  Jul01 4831:24 
> > /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph
> > ceph       34881  4.1 35.1 156801676 138763588 ? Ssl  Jul01 4710:19 
> > /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph
> > ceph       36766  3.7 35.1 158070740 138703240 ? Ssl  Jul01 4341:42 
> > /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph
> > ceph       37013  3.5 35.0 157767668 138272248 ? Ssl  Jul01 4094:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph
> > ceph       35007  3.4 35.1 160318780 138756404 ? Ssl  Jul01 3963:21 
> > /usr/bin/ceph-osd -f --cluster apollo --id 2 --setuser ceph --setgroup ceph
> > ceph       35217  3.5 35.1 159023744 138626680 ? Ssl  Jul01 4041:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 22 --setuser ceph --setgroup ceph
> > ceph       36962  3.2 35.1 158692228 138730292 ? Ssl  Jul01 3772:35 
> > /usr/bin/ceph-osd -f --cluster apollo --id 5 --setuser ceph --setgroup ceph
> > ceph     2991351  2.6 22.9 92011392 90761128 ?   Ssl  Sep02 666:32 
> > /usr/bin/ceph-osd -f --cluster apollo --id 21 --setuser ceph --setgroup ceph
> > ceph       35503  3.2 35.0 158784940 138502100 ? Ssl  Jul01 3766:33 
> > /usr/bin/ceph-osd -f --cluster apollo --id 25 --setuser ceph --setgroup ceph
> > ceph       35683  3.6 35.1 160927812 138678080 ? Ssl  Jul01 4233:17 
> > /usr/bin/ceph-osd -f --cluster apollo --id 4 --setuser ceph --setgroup ceph
> > ceph       36969  3.7 35.1 158701188 138745028 ? Ssl  Jul01 4348:06 
> > /usr/bin/ceph-osd -f --cluster apollo --id 20 --setuser ceph --setgroup ceph
> > ceph     1902641  2.5 24.1 96688368 95438808 ?   Ssl  Sep02 633:45 
> > /usr/bin/ceph-osd -f --cluster apollo --id 0 --setuser ceph --setgroup ceph
> > ceph       35576  3.7 35.1 156262424 138750552 ? Ssl  Jul01 4338:09 
> > /usr/bin/ceph-osd -f --cluster apollo --id 27 --setuser ceph --setgroup ceph
> > ceph     1901746  2.5 24.8 99300108 98051192 ?   Ssl  Sep02 641:52 
> > /usr/bin/ceph-osd -f --cluster apollo --id 6 --setuser ceph --setgroup ceph
> > ceph       35735  3.7 35.1 156027400 138738076 ? Ssl  Jul01 4350:00 
> > /usr/bin/ceph-osd -f --cluster apollo --id 24 --setuser ceph --setgroup ceph
> > ceph       35929  3.7 35.0 160626040 138511872 ? Ssl  Jul01 4361:54 
> > /usr/bin/ceph-osd -f --cluster apollo --id 9 --setuser ceph --setgroup ceph
> > ceph       35699  3.1 35.1 158773084 138728576 ? Ssl  Jul01 3631:13 
> > /usr/bin/ceph-osd -f --cluster apollo --id 10 --setuser ceph --setgroup ceph
> > ceph     2941709  2.5 24.2 97125336 95906728 ?   Ssl  Sep02 638:11 
> > /usr/bin/ceph-osd -f --cluster apollo --id 28 --setuser ceph --setgroup ceph
> > ceph       38429  3.2 35.1 156638164 138712612 ? Ssl  Jul01 3687:45 
> > /usr/bin/ceph-osd -f --cluster apollo --id 12 --setuser ceph --setgroup ceph
> > ceph       38651  3.3 35.1 159650296 138735924 ? Ssl  Jul01 3835:51 
> > /usr/bin/ceph-osd -f --cluster apollo --id 26 --setuser ceph --setgroup ceph
> > ceph       35890  2.9 35.1 156923512 138734428 ? Ssl  Jul01 3361:21 
> > /usr/bin/ceph-osd -f --cluster apollo --id 11 --setuser ceph --setgroup ceph
> > ceph       36129  3.3 35.1 158782748 138739248 ? Ssl  Jul01 3845:41 
> > /usr/bin/ceph-osd -f --cluster apollo --id 23 --setuser ceph --setgroup ceph
>
> some logs :
>
> > 2019-09-19 08:52:33.960242 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62427.674399 secs
> > 2019-09-19 08:52:37.527465 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62431.241789 secs
> > 2019-09-19 08:52:42.527581 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62436.241899 secs
> > 2019-09-19 08:52:38.960358 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62432.674515 secs
> > 2019-09-19 08:52:43.960476 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62437.674620 secs
> > 2019-09-19 08:52:47.527663 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62441.241987 secs
> > 2019-09-19 08:52:52.527770 mds.icadmin007 [WRN] 3 slow requests, 2 included 
> > below; oldest blocked for > 62446.242061 secs
> > 2019-09-19 08:52:52.527777 mds.icadmin007 [WRN] slow request 61444.792236 
> > seconds old, received at 2019-09-18 17:48:47.735459: internal op 
> > exportdir:mds.1:13 currently failed to wrlock, waiting
> > 2019-09-19 08:52:52.527783 mds.icadmin007 [WRN] slow request 61444.792163 
> > seconds old, received at 2019-09-18 17:48:47.735533: internal op 
> > exportdir:mds.1:14 currently failed to wrlock, waiting
> > 2019-09-19 08:52:48.960590 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62442.674748 secs
> > 2019-09-19 08:52:53.960684 mds.icadmin006 [WRN] 10 slow requests, 2 
> > included below; oldest blocked for > 62447.674825 secs
> > 2019-09-19 08:52:53.960692 mds.icadmin006 [WRN] slow request 61441.895507 
> > seconds old, received at 2019-09-18 17:48:52.065114: rejoin:mds.1:13 
> > currently dispatched
> > 2019-09-19 08:52:53.960697 mds.icadmin006 [WRN] slow request 61441.895489 
> > seconds old, received at 2019-09-18 17:48:52.065131: rejoin:mds.1:14 
> > currently dispatched
> > 2019-09-19 08:52:57.527852 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62451.242174 secs
> > 2019-09-19 08:53:02.527972 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62456.242289 secs
> > 2019-09-19 08:52:58.960777 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62452.674936 secs
> > 2019-09-19 08:53:03.960853 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62457.675011 secs
> > 2019-09-19 08:53:07.528033 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62461.242354 secs
> > 2019-09-19 08:53:12.528177 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62466.242487 secs
> > 2019-09-19 08:53:08.960965 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62462.675123 secs
> > 2019-09-19 08:53:13.961034 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62467.675195 secs
> > 2019-09-19 08:53:17.528276 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62471.242592 secs
> > 2019-09-19 08:53:22.528407 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62476.242729 secs
> > 2019-09-19 08:53:18.961149 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62472.675310 secs
> > 2019-09-19 08:53:23.961234 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62477.675392 secs
> > 2019-09-19 08:53:27.528509 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62481.242832 secs
> > 2019-09-19 08:53:32.528651 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62486.242961 secs
> > 2019-09-19 08:53:28.961314 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62482.675471 secs
> > 2019-09-19 08:53:33.961393 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62487.675549 secs
> > 2019-09-19 08:53:37.528706 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62491.243031 secs
> > 2019-09-19 08:53:42.528790 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62496.243105 secs
> > 2019-09-19 08:53:38.961476 mds.icadmin006 [WRN] 10 slow requests, 1 
> > included below; oldest blocked for > 62492.675617 secs
> > 2019-09-19 08:53:38.961485 mds.icadmin006 [WRN] slow request 61441.151061 
> > seconds old, received at 2019-09-18 17:49:37.810351: 
> > client_request(client.21441:176429 getattr pAsLsXsFs #0x10000f2b1b3 
> > 2019-09-18 17:49:37.806002 caller_uid=204878, caller_gid=11233{}) currently 
> > failed to rdlock, waiting
> > 2019-09-19 08:53:43.961569 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62497.675728 secs
> > 2019-09-19 08:53:47.528891 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62501.243214 secs
> > 2019-09-19 08:53:52.529021 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62506.243337 secs
> > 2019-09-19 08:53:48.961685 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62502.675839 secs
> > 2019-09-19 08:53:53.961792 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62507.675948 secs
> > 2019-09-19 08:53:57.529113 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62511.243437 secs
> > 2019-09-19 08:54:02.529224 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62516.243546 secs
> > 2019-09-19 08:53:58.961866 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62512.676025 secs
> > 2019-09-19 08:54:03.961939 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62517.676099 secs
>
> Thanks for your help.

If you haven't set:

osd op queue cut off = high

in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
help quite a bit with pure HDD clusters.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to