On Thu, Nov 16, 2017 at 4:44 PM, Ivan Babrou <[email protected]> wrote:
> Hello,
>
> We are seeing high IO spikes during writeback. While writes can
> obviously slow down reads, we don't see the issue with direct IO. We
> tested different schedulers, writeback throttling and a bunch of
> different machines with on 4.9 and 4.14 with both SATA and NVMe
> devices. The results are very consistent.
>
> There's a bunch of numbers, but please bear with me. TL;DR: writeback
> elevates latencies a lot more than direct writes.
>
> Consider the following fio scenario:
>
> [global]
> direct=1
> runtime=30m
> group_reporting
> time_based
>
> [randread]
> rw=randread
> size=1g
> bs=4k
> numjobs=20
>
> We run this on 4.9.59 with XFS on Micron 1100 SSD with deadline
> scheduler. It's a dedicated disk with no activity besides our tests.
> With no writes present, xfsdist from bcc-tools reports the following
> latency distribution (in milliseconds):
>
> $ sudo /usr/share/bcc/tools/xfsdist -m
> Tracing XFS operation latency... Hit Ctrl-C to end.
> ^C
>
> operation = 'read'
> msecs : count distribution
> 0 -> 1 : 482298 |****************************************|
>
> All reads completing under a millisecond here. Here's the output from fio:
>
> $ sudo fio ~/slowread.fio
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> ...
> fio-2.1.11
> Starting 20 processes
> ^Cbs: 20 (f=20): [r(20)] [3.8% done] [312.1MB/0KB/0KB /s] [80.2K/0/0
> iops] [eta 28m:52s]
> fio: terminating on signal 2
>
> randread: (groupid=0, jobs=20): err= 0: pid=11804: Wed Nov 15 01:22:00 2017
> read : io=21197MB, bw=317816KB/s, iops=79453, runt= 68297msec
> clat (usec): min=64, max=6860, avg=250.49, stdev=93.80
> lat (usec): min=64, max=6860, avg=250.56, stdev=93.80
> clat percentiles (usec):
> | 1.00th=[ 149], 5.00th=[ 159], 10.00th=[ 165], 20.00th=[ 177],
> | 30.00th=[ 189], 40.00th=[ 207], 50.00th=[ 225], 60.00th=[ 245],
> | 70.00th=[ 270], 80.00th=[ 310], 90.00th=[ 374], 95.00th=[ 438],
> | 99.00th=[ 580], 99.50th=[ 644], 99.90th=[ 764], 99.95th=[ 820],
> | 99.99th=[ 940]
> bw (KB /s): min=13096, max=16432, per=5.00%, avg=15898.70, stdev=196.77
> lat (usec) : 100=0.01%, 250=61.93%, 500=35.54%, 750=2.40%, 1000=0.12%
> lat (msec) : 2=0.01%, 10=0.01%
> cpu : usr=0.91%, sys=3.48%, ctx=5426681, majf=0, minf=142
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> issued : total=r=5426464/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> READ: io=21197MB, aggrb=317815KB/s, minb=317815KB/s,
> maxb=317815KB/s, mint=68297msec, maxt=68297msec
>
> Disk stats (read/write):
> sdb: ios=5416500/2, merge=0/0, ticks=1323879/1, in_queue=1325625,
> util=100.00%
>
> All good here.
>
> Let's measure pure write performance:
>
> $ sudo dd if=/dev/zero of=/cache2/zeros bs=1M count=2K oflag=direct
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 5.06064 s, 424 MB/s
>
> $ sudo /usr/share/bcc/tools/xfsdist -m
> Tracing XFS operation latency... Hit Ctrl-C to end.
> ^C
>
> operation = 'write'
> msecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 2042 |****************************************|
> 4 -> 7 : 0 | |
> 8 -> 15 : 6 | |
>
> operation = 'open'
> msecs : count distribution
> 0 -> 1 : 4 |****************************************|
>
>
> Baseline is between 2ms and 3ms plus a few 8ms spikes.
>
> If we do buffered writes instead of direct writes:
>
> $ sudo dd if=/dev/zero of=/cache2/zeros bs=1M count=2K
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 4.70492 s, 456 MB/s
>
> ivan@504m5:~$ sudo /usr/share/bcc/tools/xfsdist -m
> Tracing XFS operation latency... Hit Ctrl-C to end.
> ^C
>
> operation = 'write'
> msecs : count distribution
> 0 -> 1 : 2048 |****************************************|
>
> operation = 'open'
> msecs : count distribution
> 0 -> 1 : 4 |****************************************|
>
> Latency is lower, which is probably expected.
>
> Let's now mix these scenarios.
>
> * Reads and direct writes
>
> $ sudo fio ~/slowread.fio
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> ...
> fio-2.1.11
> Starting 20 processes
> ^Cbs: 20 (f=20): [r(20)] [1.2% done] [311.9MB/0KB/0KB /s] [79.9K/0/0
> iops] [eta 29m:40s]
> fio: terminating on signal 2
>
> randread: (groupid=0, jobs=20): err= 0: pid=11965: Wed Nov 15 01:25:46 2017
> read : io=5026.1MB, bw=253054KB/s, iops=63263, runt= 20342msec
> clat (usec): min=131, max=14120, avg=314.79, stdev=441.68
> lat (usec): min=131, max=14121, avg=314.87, stdev=441.75
> clat percentiles (usec):
> | 1.00th=[ 149], 5.00th=[ 157], 10.00th=[ 165], 20.00th=[ 177],
> | 30.00th=[ 191], 40.00th=[ 207], 50.00th=[ 227], 60.00th=[ 249],
> | 70.00th=[ 278], 80.00th=[ 322], 90.00th=[ 402], 95.00th=[ 502],
> | 99.00th=[ 2672], 99.50th=[ 3664], 99.90th=[ 4384], 99.95th=[ 4448],
> | 99.99th=[ 4768]
> bw (KB /s): min= 1197, max=16416, per=4.98%, avg=12608.77, stdev=5953.29
> lat (usec) : 250=60.53%, 500=34.40%, 750=2.32%, 1000=0.19%
> lat (msec) : 2=0.26%, 4=1.82%, 10=0.47%, 20=0.01%
> cpu : usr=0.65%, sys=2.96%, ctx=1286981, majf=0, minf=139
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> issued : total=r=1286905/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> READ: io=5026.1MB, aggrb=253053KB/s, minb=253053KB/s,
> maxb=253053KB/s, mint=20342msec, maxt=20342msec
>
> Disk stats (read/write):
> sdb: ios=1271838/4099, merge=0/1, ticks=392418/6967,
> in_queue=399788, util=99.60%
>
> $ sudo dd if=/dev/zero of=/cache2/zeros bs=1M count=2K oflag=direct
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 4.69243 s, 458 MB/s
>
> $ sudo /usr/share/bcc/tools/xfsdist -m
> Tracing XFS operation latency... Hit Ctrl-C to end.
> ^C
>
> operation = 'read'
> msecs : count distribution
> 0 -> 1 : 792289 |****************************************|
> 2 -> 3 : 22484 |* |
> 4 -> 7 : 5817 | |
> 8 -> 15 : 112 | |
>
> operation = 'write'
> msecs : count distribution
> 0 -> 1 : 0 | |
> 2 -> 3 : 2042 |****************************************|
> 4 -> 7 : 0 | |
> 8 -> 15 : 6 | |
>
> operation = 'open'
> msecs : count distribution
> 0 -> 1 : 1 |****************************************|
>
> Reads are definitely slower now. This is expected, because writes
> occupy some bandwidth.
>
> * Reads and buffered writes
>
> $ sudo fio ~/slowread.fio
> randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1
> ...
> fio-2.1.11
> Starting 20 processes
> ^Cbs: 20 (f=20): [r(20)] [1.2% done] [309.6MB/0KB/0KB /s] [79.3K/0/0
> iops] [eta 29m:38s]
> fio: terminating on signal 2
>
> randread: (groupid=0, jobs=20): err= 0: pid=12011: Wed Nov 15 01:27:00 2017
> read : io=5344.6MB, bw=255500KB/s, iops=63874, runt= 21420msec
> clat (usec): min=88, max=129414, avg=311.99, stdev=2022.27
> lat (usec): min=88, max=129415, avg=312.06, stdev=2022.29
> clat percentiles (usec):
> | 1.00th=[ 149], 5.00th=[ 157], 10.00th=[ 165], 20.00th=[ 177],
> | 30.00th=[ 191], 40.00th=[ 207], 50.00th=[ 225], 60.00th=[ 247],
> | 70.00th=[ 274], 80.00th=[ 310], 90.00th=[ 378], 95.00th=[ 442],
> | 99.00th=[ 596], 99.50th=[ 660], 99.90th=[ 1012], 99.95th=[64256],
> | 99.99th=[75264]
> bw (KB /s): min= 40, max=16448, per=5.06%, avg=12924.17, stdev=5972.12
> lat (usec) : 100=0.01%, 250=61.56%, 500=35.78%, 750=2.44%, 1000=0.12%
> lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
> lat (msec) : 100=0.08%, 250=0.01%
> cpu : usr=0.46%, sys=3.19%, ctx=1368274, majf=0, minf=134
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> >=64=0.0%
> issued : total=r=1368200/w=0/d=0, short=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
> READ: io=5344.6MB, aggrb=255499KB/s, minb=255499KB/s,
> maxb=255499KB/s, mint=21420msec, maxt=21420msec
>
> Disk stats (read/write):
> sdb: ios=1366013/2048, merge=0/0, ticks=417412/573945,
> in_queue=993029, util=99.87%
>
> $ sudo dd if=/dev/zero of=/cache2/zeros bs=1M count=2K
> 2048+0 records in
> 2048+0 records out
> 2147483648 bytes (2.1 GB) copied, 4.83031 s, 445 MB/s
>
> $ sudo /usr/share/bcc/tools/xfsdist -m
> Tracing XFS operation latency... Hit Ctrl-C to end.
> ^C
>
> operation = 'read'
> msecs : count distribution
> 0 -> 1 : 892854 |****************************************|
> 2 -> 3 : 12 | |
> 4 -> 7 : 8 | |
> 8 -> 15 : 1 | |
> 16 -> 31 : 37 | |
> 32 -> 63 : 483 | |
> 64 -> 127 : 787 | |
> 128 -> 255 : 3 | |
>
> operation = 'write'
> msecs : count distribution
> 0 -> 1 : 2048 |****************************************|
>
> operation = 'open'
> msecs : count distribution
> 0 -> 1 : 1 |****************************************|
>
> This is a lot worse, even though we wrote the same amount of data into
> the same disk. Some reads went up into 100ms+ range, which is not
> great for an SSD. Read IOPS in fio drop from 70000 into 700 during
> writeback. This doesn't seem right.
>
> Deadline scheduler is the best case here, with cfq we saw 4096ms+
> latencies for reads.
>
> Is this expected? How can we improve situation to avoid hitting high
> latencies?
>
> Thanks.
I re-ran the test with Intel S3500 SSD, XFS on LUKS with deadline
scheduler and remounting FS with "sync" option improved performance.
* Buffered writes: ~80-110 reads/s during writeback flush
$ fio ~/slowread.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 20 processes
^Cbs: 20 (f=20): [r(20)] [2.0% done] [288.3MB/0KB/0KB /s] [73.8K/0/0
iops] [eta 29m:25s]
fio: terminating on signal 2
randread: (groupid=0, jobs=20): err= 0: pid=6376: Sun Nov 19 19:19:44 2017
read : io=6152.5MB, bw=177484KB/s, iops=44370, runt= 35497msec
clat (usec): min=113, max=552524, avg=449.86, stdev=6366.77
lat (usec): min=113, max=552524, avg=449.95, stdev=6366.78
clat percentiles (usec):
| 1.00th=[ 161], 5.00th=[ 183], 10.00th=[ 197], 20.00th=[ 215],
| 30.00th=[ 227], 40.00th=[ 241], 50.00th=[ 255], 60.00th=[ 270],
| 70.00th=[ 290], 80.00th=[ 318], 90.00th=[ 362], 95.00th=[ 406],
| 99.00th=[ 516], 99.50th=[ 564], 99.90th=[ 2736], 99.95th=[214016],
| 99.99th=[325632]
lat (usec) : 250=46.46%, 500=52.36%, 750=1.02%, 1000=0.03%
lat (msec) : 2=0.03%, 4=0.02%, 10=0.01%, 20=0.01%, 100=0.01%
lat (msec) : 250=0.08%, 500=0.01%, 750=0.01%
cpu : usr=0.39%, sys=1.63%, ctx=1575395, majf=0, minf=214
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1575037/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: io=6152.5MB, aggrb=177483KB/s, minb=177483KB/s,
maxb=177483KB/s, mint=35497msec, maxt=35497msec
Disk stats (read/write):
dm-12: ios=1567287/2057, merge=0/0, ticks=696784/14558300,
in_queue=15256820, util=99.82%, aggrios=1575037/2085, aggrmerge=0/1,
aggrticks=688660/1972564, aggrin_queue=2662772, aggrutil=99.81%
sdm: ios=1575037/2085, merge=0/1, ticks=688660/1972564,
in_queue=2662772, util=99.81%
$ sudo /usr/share/bcc/tools/xfsdist -m
Tracing XFS operation latency... Hit Ctrl-C to end.
^C
operation = 'read'
msecs : count distribution
0 -> 1 : 657835 |****************************************|
2 -> 3 : 105 | |
4 -> 7 : 16 | |
8 -> 15 : 4 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 638 | |
128 -> 255 : 593 | |
256 -> 511 : 219 | |
512 -> 1023 : 1 | |
operation = 'write'
msecs : count distribution
0 -> 1 : 2057 |****************************************|
operation = 'open'
msecs : count distribution
0 -> 1 : 1 |****************************************|
$ sudo dd if=/dev/zero of=zeros bs=1M count=2K
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.0991 s, 2.0 GB/s
* Remount disk with "sync" option
$ sudo mount -o remount,sync /state
* Direct writes: ~13000 reads/s during write
$ fio ~/slowread.fio
randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 20 processes
^Cbs: 20 (f=20): [r(20)] [2.0% done] [283.3MB/0KB/0KB /s] [72.5K/0/0
iops] [eta 29m:25s]
fio: terminating on signal 2
randread: (groupid=0, jobs=20): err= 0: pid=6956: Sun Nov 19 19:21:00 2017
read : io=5431.5MB, bw=157443KB/s, iops=39360, runt= 35326msec
clat (usec): min=113, max=9525, avg=507.17, stdev=1037.08
lat (usec): min=113, max=9525, avg=507.28, stdev=1037.09
clat percentiles (usec):
| 1.00th=[ 159], 5.00th=[ 183], 10.00th=[ 197], 20.00th=[ 215],
| 30.00th=[ 231], 40.00th=[ 247], 50.00th=[ 262], 60.00th=[ 278],
| 70.00th=[ 306], 80.00th=[ 346], 90.00th=[ 462], 95.00th=[ 1768],
| 99.00th=[ 6432], 99.50th=[ 7264], 99.90th=[ 8256], 99.95th=[ 8384],
| 99.99th=[ 8640]
lat (usec) : 250=42.71%, 500=48.22%, 750=2.00%, 1000=0.48%
lat (msec) : 2=2.05%, 4=1.79%, 10=2.75%
cpu : usr=0.39%, sys=1.77%, ctx=1390713, majf=0, minf=193
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1390460/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: io=5431.5MB, aggrb=157443KB/s, minb=157443KB/s,
maxb=157443KB/s, mint=35326msec, maxt=35326msec
Disk stats (read/write):
dm-12: ios=1377040/4218, merge=0/0, ticks=692092/18748,
in_queue=713500, util=99.85%, aggrios=1390460/8353, aggrmerge=0/1,
aggrticks=686052/20124, aggrin_queue=706764, aggrutil=99.79%
sdm: ios=1390460/8353, merge=0/1, ticks=686052/20124,
in_queue=706764, util=99.79%
$ sudo /usr/share/bcc/tools/xfsdist -m
Tracing XFS operation latency... Hit Ctrl-C to end.
^C
operation = 'read'
msecs : count distribution
0 -> 1 : 749225 |****************************************|
2 -> 3 : 24720 |* |
4 -> 7 : 35377 |* |
8 -> 15 : 2764 | |
operation = 'open'
msecs : count distribution
0 -> 1 : 263 |****************************************|
operation = 'fsync'
msecs : count distribution
0 -> 1 : 3 | |
2 -> 3 : 1 | |
4 -> 7 : 39 | |
8 -> 15 : 2014 |****************************************|
$ sudo dd if=/dev/zero of=zeros bs=1M count=2K
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 20.2343 s, 106 MB/s
Adding fsync actually improved read throughput during writes 162x and
cut max read latency from 552ms to 9.5ms.