Re: [DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Paul Freeman Tue, 08 Jan 2013 19:51:42 -0800

> -----Original Message-----
> From: Sebastian Riemer [mailto:[email protected]]
> Sent: Tuesday, 8 January 2013 10:24 PM
> To: Paul Freeman
> Cc: [email protected]; Sebastian Riemer
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 07.01.2013 22:51, Paul Freeman wrote:
> > I have come across a few references of similar behaviour on the net but
> have not come across the solution(s) which appear relevant in my
> situation.
> >
> > Any comments and suggestions would be welcome.
> 
> The first thing to do is a "blktrace". I've already described how to do
> this on this list.
> 
> The interesting bit is how big your IOs are for regular data as well as
> for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct
> IO for this.
> 
> If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely
> high, then you know that something slows down the IOs.
> Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14:
> 
> http://git.drbd.org/gitweb.cgi?p=drbd-
> 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554
> 
> Repeat the blktracing connected and disconnected.
> 
> If you see 4 KiB IOs while connected, then it is an/the IO limits bug.


Sebastian,
Thank you for taking the time to read my post and provide some advice and 
information about the possible causes.

I had come across blktrace but was not sure how to use it effectively.  Thanks 
for the tips.

I have performed some testing on the primary iscsi/drbd node monitoring 
I/O on the physical disc (/dev/sdb - RAID 10 3 TB configured as 3 partitions 
and 3 LVM volume groups) and the drbd device using blktrace for each test as 
follows:

        1. use dd to write 100M directly to the LVM logical volume used as the 
backing device for the drbd device (/dev/drbd9)
        2. use dd to write 100M to the resource drbd device (/dev/drbd9) with 
the drbd resource disconnected
        3. use dd to write 100M to the resource drbd device (/dev/drbd9) with 
the drbd resource connected

I used the following blktrace command:
blktrace /dev/sdb /dev/drdb9 -b 4096

I used the following dd command:
dd if=/dev/zero of=/dev/drbd9 oflag=direct bs=1M count=100

Results Summary

The dd throughput results were as follows:

Direct to LVM logical volume (disconnected - not that that actually matters)

100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.133475 s, 786 MB/s

Drbd resource disconnected-
        100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied, 0.134998 s, 777 MB/s

Drbd resource connected-
        100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied, 0.710407 s, 148 MB/s

These results are reproducible.

I am not sure how to analyse and present the blktrace/blkparse results but as a 
first attempt (actually about the tenth attempt!) I have run 
"blkparse sdb drbd9 | egrep '147,9|8,16|8,17' | egrep '\[dd\]|\[0\]|<-' | grep 
-v '8,19'".

I think this is showing the key information to be able to identify where the 
delays are occurring.

147,9 is the major,minor of /dev/drbd9
8,16 is the major,minor of /dev/sdb
8,17 is the major,minor number of the actual partition (/dev/sdb1) on the 
physical disk containing the LVM volume group and logical volume and drbd 
resource used in the testing.

I have spent quite some time reviewing the blktrace data and believe I have a 
reasonable idea of what is going on now.  I can provide the data off-list if 
required.  I have shown below what I think are the key parts.

Drbd resource disconnected-

147,9   0       1       21.005266905    4350    Q       WS      0       +       
256     [dd]
8,17    0       6236    21.005272149    4350    A       WS      512     +       
256     <-
8,16    0       6237    21.005272439    4350    A       WS      2560    +       
256     <-
8,16    0       6238    21.005272632    4350    Q       WS      2560    +       
256     [dd]
8,16    0       6239    21.005273752    4350    G       WS      2560    +       
256     [dd]
8,16    0       6240    21.005274750    4350    I       WS      2560    +       
256     [dd]
8,16    0       6241    21.005275586    4350    D       WS      2560    +       
256     [dd]
147,9   0       2       21.005287503    4350    Q       WS      256     +       
256     [dd]
8,17    0       6242    21.005289389    4350    A       WS      768     +       
256     <-
8,16    0       6243    21.005289500    4350    A       WS      2816    +       
256     <-
8,16    0       6244    21.005289633    4350    Q       WS      2816    +       
256     [dd]
8,16    0       6245    21.005290137    4350    G       WS      2816    +       
256     [dd]
8,16    0       6246    21.005290647    4350    I       WS      2816    +       
256     [dd]
8,16    0       6247    21.005291023    4350    D       WS      2816    +       
256     [dd]
...
147,9   0       8       21.005361065    4350    Q       WS      1792    +       
256     [dd]
8,17    0       6278    21.005362755    4350    A       WS      2304    +       
256     <-
8,16    0       6279    21.005362861    4350    A       WS      4352    +       
256     <-
8,16    0       6280    21.005362990    4350    Q       WS      4352    +       
256     [dd]
8,16    0       6281    21.005363326    4350    G       WS      4352    +       
256     [dd]
8,16    0       6282    21.005363760    4350    I       WS      4352    +       
256     [dd]
8,16    0       6283    21.005364039    4350    D       WS      4352    +       
256     [dd]
8,16    0       6284    21.005511623    0       C       WS      2560    +       
256     [0]
8,16    0       6285    21.005628122    0       C       WS      2816    +       
256     [0]
8,16    0       6286    21.005735922    0       C       WS      3072    +       
256     [0]
8,16    0       6287    21.005885404    0       C       WS      3328    +       
256     [0]
8,16    0       6288    21.005998357    0       C       WS      3584    +       
256     [0]
8,16    0       6289    21.006143985    0       C       WS      3840    +       
256     [0]
8,16    0       6290    21.006287354    0       C       WS      4096    +       
256     [0]
8,16    0       6291    21.006414761    0       C       WS      4352    +       
256     [0]
147,9   0       9       21.006599134    4350    Q       WS      2048    +       
256     [dd]
...

Drbd resource connected-

147,9   4       1       137.254358367   29404   Q       WS      0       +       
256     [dd]
8,17    4       5858    137.254366442   29404   A       WS      512     +       
256     <-
8,16    4       5859    137.254366919   29404   A       WS      2560    +       
256     <-
8,16    4       5860    137.254367231   29404   Q       WS      2560    +       
256     [dd]
8,16    4       5861    137.254368957   29404   G       WS      2560    +       
256     [dd]
8,16    4       5862    137.254369927   29404   I       WS      2560    +       
256     [dd]
8,16    4       5863    137.254371124   29404   D       WS      2560    +       
256     [dd]
147,9   4       2       137.254383956   29404   Q       WS      256     +       
256     [dd]
8,17    4       5864    137.254386137   29404   A       WS      768     +       
256     <-
8,16    4       5865    137.254386280   29404   A       WS      2816    +       
256     <-
8,16    4       5866    137.254386413   29404   Q       WS      2816    +       
256     [dd]
8,16    4       5867    137.254386945   29404   G       WS      2816    +       
256     [dd]
8,16    4       5868    137.254387471   29404   I       WS      2816    +       
256     [dd]
8,16    4       5869    137.254387809   29404   D       WS      2816    +       
256     [dd]
...
147,9   4       8       137.254460623   29404   Q       WS      1792    +       
256     [dd]
8,17    4       5900    137.254462917   29404   A       WS      2304    +       
256     <-
8,16    4       5901    137.254463007   29404   A       WS      4352    +       
256     <-
8,16    4       5902    137.254463129   29404   Q       WS      4352    +       
256     [dd]
8,16    4       5903    137.254463480   29404   G       WS      4352    +       
256     [dd]
8,16    4       5904    137.254463927   29404   I       WS      4352    +       
256     [dd]
8,16    4       5905    137.254464193   29404   D       WS      4352    +       
256     [dd]
8,16    0       23502   137.254625838   0       C       WS      2560    +       
256     [0]
8,16    0       23503   137.254783835   0       C       WS      2816    +       
256     [0]
8,16    0       23504   137.254895659   0       C       WS      3072    +       
256     [0]
8,16    0       23505   137.255035266   0       C       WS      3328    +       
256     [0]
8,16    0       23506   137.255174766   0       C       WS      3584    +       
256     [0]
8,16    0       23507   137.255272516   0       C       WS      3840    +       
256     [0]
8,16    0       23508   137.255405754   0       C       WS      4096    +       
256     [0]
8,16    0       23509   137.255536862   0       C       WS      4352    +       
256     [0]
147,9   4       9       137.263222933   29404   Q       WS      2048    +       
256     [dd]
...

Analysis:
1. In both connected and disconnected cases, the write sizes used by DRBD are 
128KiB so I think this is OK.  At least they are not 4KiB which would indicate 
the IO problem you mention.

2. In the connected case there is a significant delay after the last IO 
completion action (C) before the next dd IO occurs.  ie 0.007686071 seconds for 
connected vs 0.000184373 seconds for disconnected.
This behaviour is reproducible and is seen after every sequence of IO 
completion actions for the entire running of dd.

>From my understanding of the blktrace data, the IO completion action occurs 
>every 1MB.

My hunch is this delay is caused by network latency ie. transferring the data 
to the network layer and then across to the secondary DRBD node via the bonded 
interface.  Also, given Protocol C is being used, I understand IO blocking 
occurs until the data is actually written to disc (or controller BBU cache) so 
perhaps the secondary is responsible in this case?

If my analysis is correct then DRBD is not responsible for the delays.

Does this sound sensible and reasonable?  If so, how can I confirm this is the 
case?  Is there a tool like blktrace for use on network IO?

Given Ubuntu 12.04 LTS is using kernel 3.2.0-XX is it worth disabling the 
intel_idle cpuidle driver as described on this list by Shaun Thomas to decrease 
latency?

http://www.engardelinux.org/modules/index/list_archives.cgi?list=drbd-user&page=0121.html&month=2012-10

Comments and suggestions welcome.

Regards

Paul
> -----Original Message-----
> From: Sebastian Riemer [mailto:[email protected]]
> Sent: Tuesday, 8 January 2013 10:24 PM
> To: Paul Freeman
> Cc: [email protected]; Sebastian Riemer
> Subject: Re: [DRBD-user] High iowait on primary DRBD node with large
> sustained writes and replication enabled to secondary
> 
> On 07.01.2013 22:51, Paul Freeman wrote:
> > I have come across a few references of similar behaviour on the net but
> have not come across the solution(s) which appear relevant in my
> situation.
> >
> > Any comments and suggestions would be welcome.
> 
> The first thing to do is a "blktrace". I've already described how to do
> this on this list.
> 
> The interesting bit is how big your IOs are for regular data as well as
> for syncer IO. Try to do as big as possible IOs (e.g. 1 MiB) with direct
> IO for this.
> 
> If IO sizes are O.K. (>= 128 KiB), but latency between IOs is extremely
> high, then you know that something slows down the IOs.
> Otherwise, it is possible that you hit the IO limits bug in DRBD < 8.3.14:
> 
> http://git.drbd.org/gitweb.cgi?p=drbd-
> 8.3.git;a=commit;h=3a2911f0c7ab80d061a5a6878c95f6d731c98554
> 
> Repeat the blktracing connected and disconnected.
> 
> If you see 4 KiB IOs while connected, then it is an/the IO limits bug.
> 
> The dumb thing is that DRBD has a dynamic IO limits detection which
> resets the IO limits to 4 KiB if you loose connection to the secondary.
> This is why people need a caching RAID controller which merges and fixes
> such crappy behavior and can't use an HBA with SW RAID.
> 
> Also with a really fast transport like QDR InfiniBand (40 Gbit/s) you'll
> see that DRBD introduces lots of latency. Can't get more than approx.
> 250 MiB/s although the storage server can do > 600 MiB/s and the
> transport can do 4 GiB/s at < 1 us latency.
> 
> This is why we enhanced MD RAID-1 for replication. MD is much more
> stable, gives up to 512 KiB IOs, has a very intelligent write-intent
> bitmap and is high performance replication. As transport we've got SRP
> which is RDMA only. With replication separated from the transport we can
> also do better blktracing. And we have the same latency for both RAID-1
> disks as we have them both on SRP remote storage.
> 
> Cheers,
> Sebastian
> 
> --
> Sebastian Riemer
> Linux Kernel Developer - Storage
> 
> We are looking for (SENIOR) LINUX KERNEL DEVELOPERS!
> 
> ProfitBricks GmbH * Greifswalder Str. 207 * 10405 Berlin, Germany
> www.profitbricks.com * [email protected]
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] High iowait on primary DRBD node with large sustained writes and replication enabled to secondary

Reply via email to