Thanks for the suggestion George, I have tried this with the fix for 4347 as well.
It looks like it didn’t make a difference for this particular lock - in the one test I ran, similar behavior was observed: 12.17% of non-idle kernel time is spent in vdev_queue_io(); 78% of time in that function is spent waiting on vq_lock. Lockstat looks same as before. I’ve tried attaching the resulting flamegraph, but looks like the list bans large e-mails. The trace was done during a regular zfs send/recv from a fast pool, with the destination system being traced at profile-997 for 6 minutes, then filtering out “idle”: In retrospect, it makes sense that Illumos #4347 did not make a difference - the dmu_tx_assign() in restore_write() was already using TXG_WAIT. Best regards, Kirill Davydychev *From:* George Wilson [mailto:[email protected]] *Sent:* Monday, December 02, 2013 3:24 PM *To:* Kirill Davydychev; [email protected] *Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io() The linux folks also discovered lock contention when testing this initially. The cause was a missing upstream push on our part so make sure you have the fix for '4347 ZPL can use dmu_tx_assign(TXG_WAIT)' if you are testing out the new write scheduler. Thanks, George On 11/29/13 1:02 PM, Kirill Davydychev wrote: Hi, I’m doing some performance testing of the new ZFS I/O scheduler (Illumos 4045 and friends) on a fairly current build of the Illumos kernel. It is maybe 2 weeks old, but that should not matter much. As part of my tests, I’m sending/receiving multi-TB datasets between two fairly fast systems, and making 10-minute flamegraphs of the kernel on the receive side, capturing where we spend most time during the 100% async write scenario. The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each; source system is identical, but also has an all-SSD pool which I’m using to generate a higher zfs send throughput to really stress the destination. The write throughput is therefore around 100MB/sec in one test, and around 400MB/sec in the other. I am observing that in both cases, while of course there is time spent elsewhere in the kernel, there is a very distinct pattern of high mutex contention in vdev_queue_io(), to a point where this function spends between 73% and 82% of its sampled stacks in mutex_vector_enter(), and under 20% actually in vdev_queue_io_add(). It gets worse with more load on the system - the 82% sample was taken during the 400MB/s test run.. Lockstat traces also confirm the observed behavior: Count indv cuml rcnt nsec Lock Caller 1128846 13% 13% 0.00 15035 0xffffff1181c77960 vdev_queue_io_add+0x4d nsec ------ Time Distribution ------ count Stack 256 | 17858 vdev_queue_io+0x8b 512 |@@ 87897 zio_vdev_io_start+0x210 1024 |@@ 86224 zio_execute+0x90 2048 |@@@@@@@ 266282 zio_nowait+0x21 4096 |@@@@@@@@ 321714 vdev_raidz_io_start+0x2ac 8192 |@@@ 127084 zio_vdev_io_start+0xb2 16384 |@ 72027 zio_execute+0x90 32768 |@ 49934 zio_nowait+0x21 65536 | 35210 vdev_mirror_io_start+0xcc 131072 | 31214 262144 | 22898 524288 | 9376 1048576 | 1039 2097152 | 79 4194304 | 8 8388608 | 2 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Caller 1001623 11% 24% 0.00 9176 0xffffff1181c77960 vdev_queue_io_remove+0x4d nsec ------ Time Distribution ------ count Stack 256 | 5521 vdev_queue_aggregate+0x2d2 512 |@ 50039 vdev_queue_io_to_issue+0x6f 1024 |@ 50836 vdev_queue_io_done+0x90 2048 |@@@@@@@@@ 320086 zio_vdev_io_done+0xde 4096 |@@@@@@@@@@@ 368465 zio_execute+0x90 8192 |@@ 92032 taskq_thread+0x2d0 16384 |@ 44114 thread_start+0x8 32768 | 24243 65536 | 16743 131072 | 13658 262144 | 10666 524288 | 4649 1048576 | 539 2097152 | 29 4194304 | 2 8388608 | 1 ------------------------------------------------------------------------------- Count indv cuml rcnt nsec Lock Caller 367598 4% 28% 0.00 20841 0xffffff1181c77960 vdev_queue_pending_remove+0x54 nsec ------ Time Distribution ------ count Stack 256 | 7001 vdev_queue_io_done+0x61 512 |@@ 34758 zio_vdev_io_done+0xde 1024 |@@ 33395 zio_execute+0x90 2048 |@@@@@ 67399 taskq_thread+0x2d0 4096 |@@@@@@@ 91609 thread_start+0x8 8192 |@@@ 45452 16384 |@@ 26110 32768 |@ 18888 65536 |@ 13833 131072 |@ 12772 262144 | 10370 524288 | 5098 1048576 | 836 2097152 | 67 4194304 | 8 8388608 | 2 This appears to be due to the fact that each vdev queue is protected by a single global mutex, vq_lock, and in my case, the queue being hit heavily is the one being used by spa_sync() - asynchronous writes. The same pattern probably holds true for other queues as well - I have not tested this yet, but see no reason to believe otherwise. Has anyone else noticed this behavior, and if so, is this normal, or something that could potentially be improved for better performance or better CPU utilization? I can provide flamegraphs that I’ve generated if anyone is interested. Best regards, Kirill Davydychev _______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
