=== Facts ===
1. RadosGW disabled, rados bench write -  10 x bigger traffic served without 
any slow_request.
2. RadosGW enabled  - first slow_requests.
3. Traffic via RadosGW – 20-50 slow_requests per minute (~0,1% of IOPS)
4. We compacted leveldb on MONs 48h before first slow_requests.  Maybe there is 
some strange correlation?

=== Details ===
ceph.log
2015-04-22 17:46:33.805926 osd.165 10.176.131.17:6810/98485 3 : [WRN] 1 slow 
requests, 1 included below; oldest blocked for > 36.621691 secs
2015-04-22 17:46:33.805933 osd.165 10.176.131.17:6810/98485 4 : [WRN] slow 
request 36.621691 seconds old, received at 2015-04-22 17:45:57.184143: 
osd_op(client.22009265.0:244576 
default.2139585.95_cache/6f63f82baac53e854c126598c6dc8aaf_800x600_90.jpg 
[create 0~0,setxattr user.rgw.idtag (23),writefull 0~9971,setxattr 
user.rgw.manifest (296),setxattr user.rgw.acl (129),setxattr 
user.rgw.content_type (25),setxattr user.rgw.etag (33),setxattr 
user.rgw.x-amz-meta-md5 (33),setxattr user.rgw.x-amz-meta-sync-priority (5)] 
10.58f70cfe e106833) v4 currently waiting for subops from [260,187]

heartbeat before slow_req:
2015-04-22 17:46:13.059658 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059669 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:13.059677 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059683 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:13.059693 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:13.059701 7fcba064e700 20 heartbeat_map is_healthy = NOT 
HEALTHY
….
2015-04-22 17:46:33.794865 7fcb9f64c700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.794877 7fcb9f64c700 20 heartbeat_map is_healthy = NOT 
HEALTHY
2015-04-22 17:46:33.795912 7fcba064e700  1 heartbeat_map is_healthy 'OSD::op_tp 
thread 0x7fcb94636700' had timed out after 15
2015-04-22 17:46:33.795923 7fcba064e700 20 heartbeat_map is_healthy = NOT 
HEALTHY

Loosy connections a few second before slow_requests, even on loopbacks:
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471808 7ffcd7837700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.131.4:0/2937379 pipe(0x103ce780 sd=180 
:6829 s=2 pgs=121 cs=1 l=1 c=0x103cbdc0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:43.471828 7ffcd7736700 10 -- 
10.176.131.4:6821/2938116 >> 10.176.131.4:0/2937379 pipe(0x103b5280 sd=742 
:6821 s=2 pgs=122 cs=1 l=1 c=0x103beb00).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229544 7ffcfd28f700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.11:0/3658878 pipe(0x1073f280 sd=735 
:6829 s=2 pgs=1919 cs=1 l=1 c=0xff6dc60).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229670 7ffcf813e700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.29:0/2882509 pipe(0x17f74000 sd=862 
:6829 s=2 pgs=582 cs=1 l=1 c=0x101e96e0).fault on lossy channel, failing
/var/log/ceph/ceph-osd.30.log:2015-04-21 14:57:44.229676 7ffcf4f0c700 10 -- 
10.176.139.4:6829/2938116 >> 10.176.139.35:0/2697022 pipe(0xff8ba00 sd=190 
:6829 s=2 pgs=13 cs=1 l=1 c=0xff6d420).fault on lossy channel, failing

Network
We checked it twice with iperf, ping, etc. It works OK, sysctls also are 
optimized. We are observing a peaks of tcp rst and tcp drops when there are 
slow_req’s, but it seems that this is caused by the problem.

Drives/Machines
We checked our drives (smart), raid contolers and machines and everything looks 
correctly. We disable caches on raid contolers and on the drives and nothing 
change. There is no leader in the histogram of slow requests. Any drive or 
machine with bigger problems then others.

Regards,

Slawomir Ritter


From: ceph-users [mailto:[email protected]] On Behalf Of Craig 
Lewis
Sent: Friday, April 17, 2015 7:56 PM
To: Dominik Mostowiec
Cc: Studziński Krzysztof; [email protected]
Subject: Re: [ceph-users] many slow requests on different osds (scrubbing 
disabled)

I've seen something like this a few times.

Once, I lost the battery in my battery backed RAID card.  That caused all the 
OSDs on that host to be slow, which triggered slow request notices pretty much 
cluster wide.  It was only when I histogrammed the slow request notices that I 
saw most of them were on a single node.  I compared the disk latency graphs 
between nodes, and saw that one node had a much higher write latency. This took 
me a while to track down.

Another time, I had a consume HDD that was slowly failing.  It would hit a 
group of bad sector, remap, repeat.  SMART warned me about it, so I replaced 
the disk after the second slow request alerts.  This was pretty straight 
forward to diagnose, only because smartd notified me.


I both cases, I saw "slow request" notices on the affect disks.  Your osd.284 
says osd.186 and osd.177 are being slow, but osd.186 and osd.177 don't claim to 
be slow.

It's possible that their is another disk that is slow, causing osd.186 and 
osd.177 replication to slow down.  With the PG distribution over OSDs, one disk 
being a little slow can affect a large number of OSDs.


If SMART doesn't show you a disk is failing, I'd start looking for disks (the 
disk itself, not the OSD daemon) with a high latency around your problem times. 
 If you focus on the problem times, give it a +/- 10 minutes window.  Sometimes 
it takes a little while for the disk slowness to spread out enough for Ceph to 
complain.


On Wed, Apr 15, 2015 at 3:20 PM, Dominik Mostowiec 
<[email protected]<mailto:[email protected]>> wrote:
Hi,
From few days we notice on our cluster many slow request.
Cluster:
ceph version 0.67.11
3 x mon
36 hosts -> 10 osd ( 4T ) + 2 SSD (journals)
Scrubbing and deep scrubbing is disabled but count of slow requests is
still increasing.
Disk utilisation is very small after we have disabled scrubbings.
Log from one write with slow with debug osd = 20/20
osd.284 - master: http://pastebin.com/xPtpNU6n
osd.186 - replica: http://pastebin.com/NS1gmhB0
osd.177 - replica: http://pastebin.com/Ln9L2Z5Z

Can you help me find what is reason of it?

--
Regards
Dominik
_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to