Hi,
I’m doing some performance testing of the new ZFS I/O scheduler (Illumos
4045 and friends) on a fairly current build of the Illumos kernel. It is
maybe 2 weeks old, but that should not matter much.
As part of my tests, I’m sending/receiving multi-TB datasets between two
fairly fast systems, and making 10-minute flamegraphs of the kernel on the
receive side, capturing where we spend most time during the 100% async
write scenario.
The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each; source
system is identical, but also has an all-SSD pool which I’m using to
generate a higher zfs send throughput to really stress the destination. The
write throughput is therefore around 100MB/sec in one test, and around
400MB/sec in the other.
I am observing that in both cases, while of course there is time spent
elsewhere in the kernel, there is a very distinct pattern of high mutex
contention in vdev_queue_io(), to a point where this function spends
between 73% and 82% of its sampled stacks in mutex_vector_enter(), and
under 20% actually in vdev_queue_io_add().
It gets worse with more load on the system - the 82% sample was taken
during the 400MB/s test run. Lockstat traces also confirm the observed
behavior:
Count indv cuml rcnt nsec Lock Caller
1128846 13% 13% 0.00 15035 0xffffff1181c77960
vdev_queue_io_add+0x4d
nsec ------ Time Distribution ------ count Stack
256 | 17858 vdev_queue_io+0x8b
512 |@@ 87897 zio_vdev_io_start+0x210
1024 |@@ 86224 zio_execute+0x90
2048 |@@@@@@@ 266282 zio_nowait+0x21
4096 |@@@@@@@@ 321714
vdev_raidz_io_start+0x2ac
8192 |@@@ 127084 zio_vdev_io_start+0xb2
16384 |@ 72027 zio_execute+0x90
32768 |@ 49934 zio_nowait+0x21
65536 | 35210
vdev_mirror_io_start+0xcc
131072 | 31214
262144 | 22898
524288 | 9376
1048576 | 1039
2097152 | 79
4194304 | 8
8388608 | 2
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
1001623 11% 24% 0.00 9176 0xffffff1181c77960
vdev_queue_io_remove+0x4d
nsec ------ Time Distribution ------ count Stack
256 | 5521
vdev_queue_aggregate+0x2d2
512 |@ 50039
vdev_queue_io_to_issue+0x6f
1024 |@ 50836 vdev_queue_io_done+0x90
2048 |@@@@@@@@@ 320086 zio_vdev_io_done+0xde
4096 |@@@@@@@@@@@ 368465 zio_execute+0x90
8192 |@@ 92032 taskq_thread+0x2d0
16384 |@ 44114 thread_start+0x8
32768 | 24243
65536 | 16743
131072 | 13658
262144 | 10666
524288 | 4649
1048576 | 539
2097152 | 29
4194304 | 2
8388608 | 1
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
367598 4% 28% 0.00 20841 0xffffff1181c77960
vdev_queue_pending_remove+0x54
nsec ------ Time Distribution ------ count Stack
256 | 7001 vdev_queue_io_done+0x61
512 |@@ 34758 zio_vdev_io_done+0xde
1024 |@@ 33395 zio_execute+0x90
2048 |@@@@@ 67399 taskq_thread+0x2d0
4096 |@@@@@@@ 91609 thread_start+0x8
8192 |@@@ 45452
16384 |@@ 26110
32768 |@ 18888
65536 |@ 13833
131072 |@ 12772
262144 | 10370
524288 | 5098
1048576 | 836
2097152 | 67
4194304 | 8
8388608 | 2
This appears to be due to the fact that each vdev queue is protected by a
single global mutex, vq_lock, and in my case, the queue being hit heavily
is the one being used by spa_sync() - asynchronous writes. The same pattern
probably holds true for other queues as well - I have not tested this yet,
but see no reason to believe otherwise.
Has anyone else noticed this behavior, and if so, is this normal, or
something that could potentially be improved for better performance or
better CPU utilization?
I can provide flamegraphs that I’ve generated if anyone is interested.
Best regards,
Kirill Davydychev
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer