Re: [ceph-users] ceph cluster having blocke requests very frequently

Thomas Danan Fri, 18 Nov 2016 09:01:52 -0800

I often read that small IO write and RBD are working better with bigger 
filestore_max_sync_interval than default value.
Default value is 5 sec and I saw many post saying they are using 30 sec.
Also the slow request symptom is often linked to this parameter.
My journals are 10GB ( collocated with OSD storage ) and overall Client IO 
write throughput is around 500MB/s at peak time.
I understand that the journal is flushed when it is half filled or when it 
reach the filestore_max_sync_interval value.
I guess I can safely give it a try ? just to see the impact ?
Not sure if I can change this online ?

Thanks

Thomas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: vendredi 18 novembre 2016 12:42
To: n...@fisk.me.uk; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Here are some logs. The system is in IST TZ and I have filtered the logs to get 
only 2 last hours during which we can observe the issue.

In that particular case, issue is illustrated with the following OSDs

Primary:
ID:607
PID:2962227
HOST:10.137.81.18

Secondary1
ID:528
PID:3721728
HOST:10.137.78.194

Secondary2
ID:771
PID:2806795
HOST:10.137.81.25

In that specific example, first slow request message is detected at 16:18

2016-11-18 16:18:51.991185 7f13acd8a700  0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 30.521107 secs
2016-11-18 16:18:51.991213 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 30.521107 seconds old, received at 2016-11-18 16:18:21.469965: 
osd_op(client.2406870.1:140440919 rbd_data.616bf2ae8944a.00000000002b85a7 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1449984~524288] 
0.4e69d0de snapc 218=[218,1fb,1df] ondisk+write e212564) currently waiting for 
subops from 528,771

I see that it is about replicating a 4MB Object with snapc context but in my 
environment I have no snapshot (actually they were all deleted). Also I was 
said those messages were not necessary related to object replication to 
snapshot image.
Each time I have a slow request message it is formatted as described with 4MB 
Object and snapc context

Rados df is showing me that I have 4 cloned objects, I do not understand why.

15 minutes later seems ops are unblocked after initiating reconnect message

2016-11-18 16:34:38.120918 7f13acd8a700  0 log_channel(cluster) log [WRN] : 
slow request 960.264008 seconds old, received at 2016-11-18 16:18:37.856850: 
osd_op(client.2406634.1:104826541 rbd_data.636fe2ae8944a.0000000000111eec 
[set-alloc-hint object_size 4194304 write_size 4194304,write 4112384~4096] 
0.f56e90de snapc 426=[426,3f9] ondisk+write e212564) currently waiting for 
subops from 528,771
2016-11-18 16:34:46.863383 7f137705a700  0 -- 10.137.81.18:6840/2962227 >> 
10.137.81.135:0/748393319 pipe(0x293bd000 sd=35 :6840 s=0 pgs=0 cs=0 l=0 
c=0x21405020).accept peer addr is really 10.137.81.135:0/748393319 (socket is 
10.137.81.135:26749/0)
2016-11-18 16:35:05.048420 7f138fea6700  0 -- 192.168.228.36:6841/2962227 >> 
192.168.228.28:6805/3721728 pipe(0x1271b000 sd=34 :50711 s=2 pgs=647 cs=5 l=0 
c=0x42798c0).fault, initiating reconnect

I do not manage to identify anything obvious in the logs.

Thanks for your help …

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: jeudi 17 novembre 2016 11:02
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Thomas,

Do you have the OSD logs from around the time of that slow request (13:12 to 
13:29 period)?

Do you also see anything about OSD’s going down in the Mon ceph.log file around 
that time?

480 seconds is probably far too long for a disk to be busy for, I’m wondering 
if the OSD is either dying and respawning or if you are running out of some 
type of system resource….eg TCP connections or something like that, which means 
the OSD’s can’t communicate with each other.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 17 November 2016 08:59
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

I have recheck the pattern when slow request are detected.

I have an example with following (primary: 411, secondary: 176, 594)
On primary slow requests detected: waiting for subops (176, 594)  during 16 
minutes !!!!

2016-11-17 13:29:27.209754 7f001d414700 0 log_channel(cluster) log [WRN] : 7 
slow requests, 7 included below; oldest blocked for > 480.477315 secs
2016-11-17 13:29:27.209777 7f001d414700 0 log_channel(cluster) log [WRN] : slow 
request 480.477315 seconds old, received at 2016-11-17 13:21:26.732303: 
osd_op(client.2407558.1:206455044 rbd_data.66ea12ae8944a.00000000001acbbc 
[set-alloc-hint object_size 4194304 write_size 4194304,write 1257472~368640] 
0.61fe279f snapc 3fd=[3fd,3de] ondisk+write e210553) currently waiting for 
subops from 176,594

So the primary OSD is waiting for subops since 13:21 (13:29 - 480 seconds)

2016-11-17 13:36:33.039691 7efffd8ee700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.7:6819/3611836 pipe(0x13ffd000 sd=33 :17791 s=2 pgs=131 cs=7 l=0 
c=0x13251de0).fault, initiating reconnect
2016-11-17 13:36:39.570692 7efff6784700 0 -- 192.168.228.23:6800/694486 >> 
192.168.228.13:6858/2033854 pipe(0x17009000 sd=60 :52188 s=2 pgs=147 cs=7 l=0 
c=0x141159c0).fault, initiating reconnect

After this log, it seems the ops are unblocked and I do not have any more the 
“currently waiting for subops from 176,594”

So primary OSD was blocked during 15 minutes in total !!!!

On the secondary OSD, I can see the following messages during the same period 
(but also after and before)

secondary:
2016-11-17 13:34:58.125076 7fbcc7517700 0 -- 192.168.228.7:6819/3611836 >> 
192.168.228.42:6832/2873873 pipe(0x12d2e000 sd=127 :6819 s=2 pgs=86 cs=5 l=0 
c=0xef18c00).fault with nothing to send, going to sandby

In some other example and with some DEBUG messages activated  I was also able 
to see the many of the following messages on secondary OSDs.
2016-11-15 03:53:04.298502 7ff9c434f700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7ff9bdb42700' had timed out after 15

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 22:13
To: Thomas Danan; n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi,

Yes, I can’t think of anything else at this stage. Could you maybe repost some 
dump historic op dumps  now that you have turned off snapshots. I wonder if 
they might reveal anything.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 16 November 2016 17:38
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

We have deleted all Snapshots and observed the system for several hours.
From what I see this did not help to reduce the blocked ops and IO freeze on 
Client ceph side.

We have also tried to increase a little bit the PGs (by 8 than 128) because 
this is something we should do and we wanted to see how the cluster was 
behaving.
During recovery, the number of blocked ops and associated duration increased 
significantly. Also the number of impacted OSDs was much more important.

Don’t really know what to conclude from all of this …

Again we have checked Disk / network / and everything seems fine …

Thomas
From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 14:01
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

The snapshot works by using Copy On Write. If you dirty even a 4kb section of a 
4MB object in the primary RBD, that entire 4MB object then needs to be read and 
then written into the snapshot RBD.

From: Thomas Danan [mailto:thomas.da...@mycom-osi.com]
Sent: 16 November 2016 12:58
To: Thomas Danan 
<thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>; 
n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney' 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Actually I was wondering, is there any difference between Snapshot or simple 
RBD image ?
With simple RBD image when doing a random IO, we are asking Ceph cluster to 
update one or several 4MB objects no ?
So Snapshotting is multiplying the load by 2 but not more, Am I wrong ?

Thomas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: mercredi 16 novembre 2016 13:52
To: n...@fisk.me.uk<mailto:n...@fisk.me.uk>; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Hi Nick,

Yes our application is doing small Random IO and I did not realize that the 
snapshotting feature could so much degrade performances in that case.

We have just deactivated it and deleted all snapshots. Will notify you if it 
drastically reduce the blocked ops and consequently the IO freeze on client 
side.

Thanks

Thomas

From: Nick Fisk [mailto:n...@fisk.me.uk]
Sent: mercredi 16 novembre 2016 13:25
To: Thomas Danan; 'Peter Maloney'
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: [ceph-users] ceph cluster having blocke requests very frequently

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Thomas 
Danan
Sent: 15 November 2016 21:14
To: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently

Very interesting ...

Any idea why optimal tunable would help here ?  on our cluster we have 500TB of 
data, I am a bit concerned about changing it without taking lot of precautions 
. ...
I am curious to know how much time it takes you to change tunable, size of your 
cluster and observed impacts on client IO ...

Yes We do have daily rbd snapshot from 16 different ceph RBD clients. 
Snapshoting the RBD image is quite immediate while we are seing the issue 
continuously during the day...

Just to point out that when you take a snapshot any writes to the original RBD 
will mean that the full 4MB object is copied into the snapshot. If you have a 
lot of small random IO going on the original RBD this can lead to massive write 
amplification across the cluster and may cause issues such as what you describe.

Also be aware that deleting large snapshots also puts significant strain on the 
OSD’s as they try and delete hundreds of thousands of objects.

Will check all of this tomorrow . ..

Thanks again

Thomas

Sent from my Samsung device

-------- Original message --------
From: Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>>
Date: 11/15/16 21:27 (GMT+01:00)
To: Thomas Danan <thomas.da...@mycom-osi.com<mailto:thomas.da...@mycom-osi.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] ceph cluster having blocke requests very frequently
On 11/15/16 14:05, Thomas Danan wrote:
> Hi Peter,
>
> Ceph cluster version is 0.94.5 and we are running with Firefly tunables and 
> also we have 10KPGs instead of the 30K / 40K we should have.
> The linux kernel version is 3.10.0-327.36.1.el7.x86_64 with RHEL 7.2
>
> On our side we havethe following settings:
> mon_osd_adjust_heartbeat_grace = false
> mon_osd_adjust_down_out_interval = false
> mon_osd_min_down_reporters = 5
> mon_osd_min_down_reports = 10
>
> explaining why the OSDs are not flapping but still they are behaving wrongly 
> and generate the slow requests I am describing.
>
> The osd_op_complaint_time is with the default value (30 sec), not sure I want 
> to change it base on your experience
I wasn't saying you should set the complaint time to 5, just saying
that's why I have complaints logged with such low block times.
> Thomas

And now I'm testing this:
        osd recovery sleep = 0.5
        osd snap trim sleep = 0.5

(or fiddling with it as low as 0.1 to make it rebalance faster)

While also changing tunables to optimal (which will rebalance 75% of the
objects)
Which has very good results so far (a few <14s blocks right at the
start, and none since, over an hour ago).

And I'm somehow hoping that will fix my rbd export-diff issue too... but
it at least appears to fix the rebalance causing blocks.

Do you use rbd snapshots? I think that may be causing my issues, based
on things like:

>             "description": "osd_op(client.692201.0:20455419 4.1b5a5bc1
> rbd_data.94a08238e1f29.000000000000617b [] snapc 918d=[918d]
> ack+ondisk+write+known_if_redirected e40036)",
>             "initiated_at": "2016-11-15 20:57:48.313432",
>             "age": 409.634862,
>             "duration": 3.377347,
>             ...
>                     {
>                         "time": "2016-11-15 20:57:48.313767",
>                         "event": "waiting for subops from 0,1,8,22"
>                     },
>             ...
>                     {
>                         "time": "2016-11-15 20:57:51.688530",
>                         "event": "sub_op_applied_rec from 22"
>                     },

Which says "snapc" in there (CoW?), and I think shows that just one osd
is delayed a few seconds and the rest are really fast, like you said.
(and not sure why I see 4 osds here when I have size 3... node1 osd 0
and 1, and node3 osd 8 and 22)

or some (shorter I think) have description like:
> osd_repop(client.426591.0:203051290 4.1f9
> 4:9fe4c001:::rbd_data.4cf92238e1f29.00000000000014ef:head v 40047'2531604)

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster having blocke requests very frequently

Reply via email to