On Tue, Dec 3, 2013 at 7:57 PM, Kirill Davydychev <
[email protected]> wrote:
> Thanks for the suggestion George, I have tried this with the fix for 4347
> as well.
>
>
>
> It looks like it didn’t make a difference for this particular lock - in
> the one test I ran, similar behavior was observed:
>
>
>
> 12.17% of non-idle kernel time is spent in vdev_queue_io(); 78% of time in
> that function is spent waiting on vq_lock.
>
The flame graph does show that. I'm surprised that vdev_queue_io_done()
doesn't also show up, since it is also grabbing vq_lock.
How many CPUs? What is the recordsize?
FWIW, George is working on some changes that should reduce the number of
entries in the vqc_queued_tree, which should reduce hold times on vq_lock.
Could you use lockstat to gather the hold and wait times for one of the
vdev's vq_lock?
--matt
>
>
> Lockstat looks same as before. I’m not sure if the mailing list allows
> attachments, but I’ve attached the resulting flamegraph. It was done during
> a regular zfs send/recv from a fast pool, with the destination system being
> traced at profile-997 for 6 minutes, then filtering out “idle”:
>
>
>
> dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); }
> tick-360s { exit(0); }' -o post-4347.out
>
>
>
>
>
> Best regards,
>
> Kirill Davydychev
>
>
>
> *From:* George Wilson [mailto:[email protected]]
> *Sent:* Monday, December 02, 2013 3:24 PM
> *To:* Kirill Davydychev; [email protected]
> *Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io()
>
>
>
> The linux folks also discovered lock contention when testing this
> initially. The cause was a missing upstream push on our part so make sure
> you have the fix for '4347 ZPL can use dmu_tx_assign(TXG_WAIT)' if you are
> testing out the new write scheduler.
>
> Thanks,
> George
>
> On 11/29/13 1:02 PM, Kirill Davydychev wrote:
>
> Hi,
>
>
>
> I’m doing some performance testing of the new ZFS I/O scheduler (Illumos
> 4045 and friends) on a fairly current build of the Illumos kernel. It is
> maybe 2 weeks old, but that should not matter much.
>
>
>
> As part of my tests, I’m sending/receiving multi-TB datasets between two
> fairly fast systems, and making 10-minute flamegraphs of the kernel on the
> receive side, capturing where we spend most time during the 100% async
> write scenario.
>
>
>
> The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each;
> source system is identical, but also has an all-SSD pool which I’m using to
> generate a higher zfs send throughput to really stress the destination. The
> write throughput is therefore around 100MB/sec in one test, and around
> 400MB/sec in the other.
>
>
>
> I am observing that in both cases, while of course there is time spent
> elsewhere in the kernel, there is a very distinct pattern of high mutex
> contention in vdev_queue_io(), to a point where this function spends
> between 73% and 82% of its sampled stacks in mutex_vector_enter(), and
> under 20% actually in vdev_queue_io_add().
>
>
>
> It gets worse with more load on the system - the 82% sample was taken
> during the 400MB/s test run... Lockstat traces also confirm the observed
> behavior:
>
>
>
> Count indv cuml rcnt nsec Lock Caller
>
> 1128846 13% 13% 0.00 15035 0xffffff1181c77960
> vdev_queue_io_add+0x4d
>
>
>
> nsec ------ Time Distribution ------ count Stack
>
> 256 | 17858 vdev_queue_io+0x8b
>
> 512 |@@ 87897
> zio_vdev_io_start+0x210
>
> 1024 |@@ 86224 zio_execute+0x90
>
> 2048 |@@@@@@@ 266282 zio_nowait+0x21
>
> 4096 |@@@@@@@@ 321714
> vdev_raidz_io_start+0x2ac
>
> 8192 |@@@ 127084 zio_vdev_io_start+0xb2
>
> 16384 |@ 72027 zio_execute+0x90
>
> 32768 |@ 49934 zio_nowait+0x21
>
> 65536 | 35210
> vdev_mirror_io_start+0xcc
>
> 131072 | 31214
>
> 262144 | 22898
>
> 524288 | 9376
>
> 1048576 | 1039
>
> 2097152 | 79
>
> 4194304 | 8
>
> 8388608 | 2
>
>
> -------------------------------------------------------------------------------
>
> Count indv cuml rcnt nsec Lock Caller
>
> 1001623 11% 24% 0.00 9176 0xffffff1181c77960
> vdev_queue_io_remove+0x4d
>
>
>
> nsec ------ Time Distribution ------ count Stack
>
> 256 | 5521
> vdev_queue_aggregate+0x2d2
>
> 512 |@ 50039
> vdev_queue_io_to_issue+0x6f
>
> 1024 |@ 50836
> vdev_queue_io_done+0x90
>
> 2048 |@@@@@@@@@ 320086 zio_vdev_io_done+0xde
>
> 4096 |@@@@@@@@@@@ 368465 zio_execute+0x90
>
> 8192 |@@ 92032 taskq_thread+0x2d0
>
> 16384 |@ 44114 thread_start+0x8
>
> 32768 | 24243
>
> 65536 | 16743
>
> 131072 | 13658
>
> 262144 | 10666
>
> 524288 | 4649
>
> 1048576 | 539
>
> 2097152 | 29
>
> 4194304 | 2
>
> 8388608 | 1
>
>
> -------------------------------------------------------------------------------
>
> Count indv cuml rcnt nsec Lock Caller
>
> 367598 4% 28% 0.00 20841 0xffffff1181c77960
> vdev_queue_pending_remove+0x54
>
>
>
> nsec ------ Time Distribution ------ count Stack
>
> 256 | 7001
> vdev_queue_io_done+0x61
>
> 512 |@@ 34758 zio_vdev_io_done+0xde
>
> 1024 |@@ 33395 zio_execute+0x90
>
> 2048 |@@@@@ 67399 taskq_thread+0x2d0
>
> 4096 |@@@@@@@ 91609 thread_start+0x8
>
> 8192 |@@@ 45452
>
> 16384 |@@ 26110
>
> 32768 |@ 18888
>
> 65536 |@ 13833
>
> 131072 |@ 12772
>
> 262144 | 10370
>
> 524288 | 5098
>
> 1048576 | 836
>
> 2097152 | 67
>
> 4194304 | 8
>
> 8388608 | 2
>
>
>
> This appears to be due to the fact that each vdev queue is protected by a
> single global mutex, vq_lock, and in my case, the queue being hit heavily
> is the one being used by spa_sync() - asynchronous writes. The same pattern
> probably holds true for other queues as well - I have not tested this yet,
> but see no reason to believe otherwise.
>
>
>
> Has anyone else noticed this behavior, and if so, is this normal, or
> something that could potentially be improved for better performance or
> better CPU utilization?
>
>
>
> I can provide flamegraphs that I’ve generated if anyone is interested.
>
>
>
> Best regards,
>
> Kirill Davydychev
>
>
>
>
> _______________________________________________
>
> developer mailing list
>
> [email protected]
>
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer