Hi Robert, all,

On 23/09/2019 17:37, Robert LeBlanc wrote:
On Mon, Sep 23, 2019 at 4:14 AM Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:
Hi all,

When syncing data with rsync, I'm often getting blocked slow requests,
which also block access to this path.

2019-09-23 11:25:49.477 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 31.895478 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:26:19.477 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 61.896079 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:27:19.478 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 121.897268 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:29:19.488 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 241.899467 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:33:19.680 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 482.087927 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:36:09.881 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 32.677511 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:36:39.881 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 62.678132 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:37:39.891 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 122.679273 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:39:39.892 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 242.684667 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:41:19.893 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 962.305681 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:43:39.923 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 482.712888 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:51:40.236 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 963.037049 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 11:57:20.308 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 1922.719287 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 12:07:40.621 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 1923.409501 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 12:29:20.639 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 3843.057602 seconds old, received at 2019-09-23
11:25:17.598152: client_request(client.38352684:92684 lookup
#0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
2019-09-23 12:39:40.872 7f4f401e8700  0 log_channel(cluster) log [WRN]
: slow request 3843.664914 seconds old, received at 2019-09-23
11:35:37.217113: client_request(client.38347357:111963 lookup
#0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
caller_gid=0{0,}) currently failed to authpin, subtree is being exported
If I try to ls this paths, the client will hang.

I tried this using ceph kernel client of centos7.6 and now also with the
ceph-fuse of 14.2.3, I see the issue with both. I tried remounting, but
that did not solve the issue, if I restart the mds, the issue goes away
- for some time


[root@mds02 ~]# ceph -s
   cluster:
     id:     92bfcf0a-1d39-43b3-b60f-44f01b630e47
     health: HEALTH_WARN
             1 MDSs report slow requests
             2 MDSs behind on trimming

   services:
     mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
     mgr: mds02(active, since 4d), standbys: mds01, mds03
     mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 up:standby
     osd: 535 osds: 535 up, 535 in

   data:
     pools:   3 pools, 3328 pgs
     objects: 375.14M objects, 672 TiB
     usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
     pgs:     3319 active+clean
              9    active+clean+scrubbing+deep

   io:
     client:   141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr

[root@mds02 ~]# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming
MDS_SLOW_REQUEST 1 MDSs report slow requests
     mdsmds02(mds.1): 2 slow requests are blocked > 30 secs
MDS_TRIM 2 MDSs behind on trimming
     mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: 200,
num_segments: 3407
     mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: 200,
num_segments: 4240
Can someone help me to debug this further?
What is the make up of your cluster? It sounds like it may be all HDD.

metadata is on ssds, data pool is on hdd in erasure code



If so, try adding this to /etc/ceph/ceph.conf on your OSDs and restart
the processes.

osd op queue cut off = high

Depending on your version (default in newer versions), adding

osd op queue = wpq

I had some cluster outage to debug, but now the cluster was healthy again, so I tried this.

I've set these parameters for all osds and restarted them. Sadly I still have the same issue:

2019-10-15 17:49:45.752 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+1 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:46.192 7f3174869700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} v=232162868 cv=232162574/232162574 ap=0+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412702 rc2035-10-29 07:32:22.000000 b6728047022346 54954765=50183087+4771678)/n(v412702 rc2035-10-29 07:32:22.000000 b6728047025656 54954765=50183087+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=0 0x559ba0c9cf00] 2019-10-15 17:49:48.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567880 from mon.0 2019-10-15 17:49:51.332 7f317a875700  1 mds.mds01 Updating MDS map to version 567881 from mon.0 2019-10-15 17:49:59.812 7f317a875700  0 mds.0.migrator nicely exporting to mds.1 [dir 0x1000bd8ac7b /backups/kyukonhome/ [2,head] auth{1=2} pv=232162881 v=232162877 cv=232162574/232162574 ap=2+2 state=1610874881|complete f(v5134 m2019-10-14 14:00:01.693665 11=2+9) n(v412703 rc2035-10-29 07:32:22.000000 b6728048257510 54954854=50183176+4771678) hs=11+0,ss=3+0 dirty=2 | child=1 replicated=1 dirty=1 waiter=0 authpin=1 0x559ba0c9cf00] 2019-10-15 17:50:03.762 7f317a875700  1 mds.mds01 Updating MDS map to version 567882 from mon.0 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 33.413517 secs 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 33.413517 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:19.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 32.804341 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567883 from mon.0 2019-10-15 17:50:24.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 38.413619 secs 2019-10-15 17:50:29.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 43.413714 secs 2019-10-15 17:50:34.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 48.413976 secs 2019-10-15 17:50:39.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 53.414073 secs 2019-10-15 17:50:44.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 58.414164 secs 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 63.414259 secs 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 63.414258 seconds old, received at 2019-10-15 17:49:46.402110: client_request(client.41588185:115259 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_844.mat 2019-10-15 17:49:46.402038 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:49.802 7f3178070700  0 log_channel(cluster) log [WRN] : slow request 62.805083 seconds old, received at 2019-10-15 17:49:47.011286: client_request(client.41588185:115282 lookup #0x20004c5bdb2/ind_psi_alpha_0_ind_psi_theta_0_seed_1308.mat 2019-10-15 17:49:47.011193 caller_uid=0, caller_gid=0{0,}) currently failed to authpin, subtree is being exported 2019-10-15 17:50:54.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 68.414353 secs 2019-10-15 17:50:59.802 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 73.414450 secs
....
019-10-15 17:53:02.422 7f317a875700  1 mds.mds01 Updating MDS map to version 567906 from mon.0 2019-10-15 17:53:04.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 198.417149 secs 2019-10-15 17:53:05.732 7f317a875700  1 mds.mds01 Updating MDS map to version 567907 from mon.0 2019-10-15 17:53:09.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 203.417241 secs 2019-10-15 17:53:14.412 7f317a875700  1 mds.mds01 Updating MDS map to version 567908 from mon.0 2019-10-15 17:53:14.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 208.417346 secs 2019-10-15 17:53:17.662 7f317a875700  1 mds.mds01 Updating MDS map to version 567909 from mon.0 2019-10-15 17:53:19.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 213.417443 secs 2019-10-15 17:53:22.002 7f317a875700  1 mds.mds01 Updating MDS map to version 567910 from mon.0 2019-10-15 17:53:24.812 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 218.417538 secs 2019-10-15 17:53:26.362 7f317a875700  1 mds.mds01 Updating MDS map to version 567911 from mon.0
...
2019-10-15 17:58:14.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 508.440553 secs 2019-10-15 17:58:18.062 7f317a875700  1 mds.mds01 Updating MDS map to version 567970 from mon.0 2019-10-15 17:58:19.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 513.440638 secs 2019-10-15 17:58:21.672 7f317a875700  1 mds.mds01 Updating MDS map to version 567971 from mon.0 2019-10-15 17:58:24.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 518.440702 secs 2019-10-15 17:58:26.162 7f317a875700  1 mds.mds01 Updating MDS map to version 567972 from mon.0 2019-10-15 17:58:29.832 7f3178070700  0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 523.440775 secs
So this blocks the cluster again..

I'm running 14.2.3 , except for the mds which are running 14.2.2 due to a bug in 14.2.3.

Thanks again!

Kenneth


can also help.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to