The below stack is pretty interesting: it has really long hold times, and
accounts for the plurality of the time the mutex is held.  Maybe it's just
that this is where we are doing most of the i/o aggregation
(vdev_queue_aggregate()), and it takes some time to copy around the data?

However, I wonder if the problem could be that we are grabbing the
spa_iokstat_lock while holding the vq_lock?  e.g. from vdev_queue_io_done
-> vdev_queue_pending_remove.  You might try to remove the code that deals
with spa_iokstat* (e.g. with #if 0) and see if that improves performance
significantly.  If so, maybe this code can be reorganized so that we can
drop the vq_lock before acquiring the spa_iokstat_lock.  It looks like the
lock is nested mainly for code organization/layering purposes, rather than
for correctness.

Adaptive mutex hold: 200943 events in 120.098 seconds (1673 events/sec)

-------------------------------------------------------------------------------
....

Count indv cuml rcnt     nsec Lock                   Caller
 8874   4%  99% 0.00   130627 0xffffff19c3f97810     vdev_queue_io_done+0xa0

      nsec ------ Time Distribution ------ count     Stack
      4096 |                               108       zio_vdev_io_done+0xde
      8192 |@@@@                           1309      zio_execute+0x90
     16384 |@@@                            1134      taskq_thread+0x2d0
     32768 |@@@                            971       thread_start+0x8
     65536 |@@@                            1002
    131072 |@@@                            1180
    262144 |@@@@@                          1751
    524288 |@@@                            1107
   1048576 |@                              297
   2097152 |                               14
   4194304 |                               1



On Tue, Dec 3, 2013 at 9:49 PM, Kirill Davydychev <
[email protected]> wrote:

> Matt,
>
>
>
> Combined response to your two e-mails:
>
>
>
> Ø  The flame graph does show that.  I'm surprised that
> vdev_queue_io_done() doesn't also show up, since it is also grabbing
> vq_lock.
>
>
>
> vdev_queue_io_done() does show up, but with very few samples waiting for
> the mutex - under 2%.
>
>
>
> Ø  The lockstat data below does not show any contention in vdev_queue_io.
>  It shows contention on the spa_iokstat_lock from
> vdev_queue_io_{add,remove}() and vdev_queue_pending_remove().  There is
> only one of these locks per pool, but the hold times should be very short..
>
>
>
> I made a mistake in pasting lockstat output by only providing the top 3
> lock takers by count - not by duration. Full lockstat trace from my latest
> test is at https://gist.github.com/kdavyd/7782663. It appears that I’ve
> managed to get the locks confused. The functions you mentioned are at the
> top in terms of callers, and are indeed holding spa_iokstat_lock. For
> vdev_queue_io(), there are multiple instances taking different locks, and
> I’ve confirmed via mdb that the vq_locks in question are for the leaf vdevs
> - for instance one of the locks below:
>
>
>
> Count indv cuml rcnt     nsec Lock
> Caller
>
> 93019   2%  57% 0.00   129204 0xffffff19c3f97810     vdev_queue_io+0x74
>
>
>
> > ::spa -v ! grep ffffff19c3f972c0
>
>     ffffff19c3f972c0 HEALTHY   -
> /dev/dsk/c0t5000C50040D97E1Bd0s0
>
>
>
> > ffffff19c3f972c0::print -t -a vdev_t vdev_queue.vq_lock
>
> ffffff19c3f97810 kmutex_t vdev_queue.vq_lock = {
>
>     ffffff19c3f97810 void *[1] vdev_queue.vq_lock._opaque = [ 0 ]
>
>
>
> The locks in vdev_queue_io() take significantly longer to take, which
> explains this function standing out in the flamegraph - but is this
> expected behavior?
>
>
>
> Ø  How many CPUs?  What is the recordsize?
>
>
>
> 24 logical CPUs (dual Xeon X5660), 16K recordsize.
>
>
>
> Ø  The queues are protected by one lock per leaf vdev (vq_lock).  For
> what it's worth, this is be no different from before the new i/o scheduler.
>
>
>
> I should probably do a test on the old scheduler then, and compare the
> behavior. I understand that the queues are protected by a single lock; the
> reason for my investigation is to understand how the write I/O path can be
> improved to better utilize the hardware at hand, since in all my tests on
> fast hardware, I cannot get individual drives anywhere close to 100%
> utilization, even after disabling CPU-consuming features such as
> compression and checksums.
>
>
>
> Ø  FWIW, George is working on some changes that should reduce the number
> of entries in the vqc_queued_tree, which should reduce hold times on
> vq_lock.
>
>
>
> I am certainly open to testing anything that might change the behavior,
> for better or for worse.
>
>
>
> Ø  Could you use lockstat to gather the hold and wait times for one of
> the vdev's vq_lock?
>
>
>
> Results from “lockstat -n 1000000000 -CH -s10 -l 0xffffff19c3f97810 sleep
> 120” for the same lock mentioned above are at
> https://gist.github.com/kdavyd/7782953. If I understand this correctly,
> vdev_queue_io_done() is the one holding the lock for significantly longer
> than others; vdev_queue_io()’s critical section is much quicker in
> comparison.
>
>
>
> Best regards,
>
> Kirill Davydychev
>
>
>
> *From:* Matthew Ahrens [mailto:[email protected]]
> *Sent:* Tuesday, December 03, 2013 10:52 PM
> *To:* Kirill Davydychev
> *Cc:* George Wilson; developer
>
> *Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io()
>
>
>
> On Tue, Dec 3, 2013 at 7:57 PM, Kirill Davydychev <
> [email protected]> wrote:
>
> Thanks for the suggestion George, I have tried this with the fix for 4347
> as well.
>
>
>
> It looks like it didn’t make a difference for this particular lock - in
> the one test I ran, similar behavior was observed:
>
>
>
> 12.17% of non-idle kernel time is spent in vdev_queue_io(); 78% of time in
> that function is spent waiting on vq_lock.
>
>
>
> The flame graph does show that.  I'm surprised that vdev_queue_io_done()
> doesn't also show up, since it is also grabbing vq_lock.
>
>
>
> How many CPUs?  What is the recordsize?
>
>
>
> FWIW, George is working on some changes that should reduce the number of
> entries in the vqc_queued_tree, which should reduce hold times on vq_lock..
>  Could you use lockstat to gather the hold and wait times for one of the
> vdev's vq_lock?
>
>
>
> --matt
>
>
>
>
>
>
>
> Lockstat looks same as before. I’m not sure if the mailing list allows
> attachments, but I’ve attached the resulting flamegraph. It was done during
> a regular zfs send/recv from a fast pool, with the destination system being
> traced at profile-997 for 6 minutes, then filtering out “idle”:
>
>
>
>
>
>
>
> dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); }
> tick-360s { exit(0); }' -o post-4347.out
>
>
>
>
>
> Best regards,
>
> Kirill Davydychev
>
>
>
> *From:* George Wilson [mailto:[email protected]]
> *Sent:* Monday, December 02, 2013 3:24 PM
> *To:* Kirill Davydychev; [email protected]
> *Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io()
>
>
>
> The linux folks also discovered lock contention when testing this
> initially. The cause was a missing upstream push on our part so make sure
> you have the fix for '4347 ZPL can use dmu_tx_assign(TXG_WAIT)' if you are
> testing out the new write scheduler.
>
> Thanks,
> George
>
> On 11/29/13 1:02 PM, Kirill Davydychev wrote:
>
> Hi,
>
>
>
> I’m doing some performance testing of the new ZFS I/O scheduler (Illumos
> 4045 and friends) on a fairly current build of the Illumos kernel. It is
> maybe 2 weeks old, but that should not matter much.
>
>
>
> As part of my tests, I’m sending/receiving multi-TB datasets between two
> fairly fast systems, and making 10-minute flamegraphs of the kernel on the
> receive side, capturing where we spend most time during the 100% async
> write scenario.
>
>
>
> The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each;
> source system is identical, but also has an all-SSD pool which I’m using to
> generate a higher zfs send throughput to really stress the destination. The
> write throughput is therefore around 100MB/sec in one test, and around
> 400MB/sec in the other.
>
>
>
> I am observing that in both cases, while of course there is time spent
> elsewhere in the kernel, there is a very distinct pattern of high mutex
> contention in vdev_queue_io(), to a point where this function spends
> between 73% and 82% of its sampled stacks in mutex_vector_enter(), and
> under 20% actually in vdev_queue_io_add().
>
>
>
> It gets worse with more load on the system - the 82% sample was taken
> during the 400MB/s test run... Lockstat traces also confirm the observed
> behavior:
>
>
>
> Count indv cuml rcnt     nsec Lock                   Caller
>
> 1128846  13%  13% 0.00    15035 0xffffff1181c77960
> vdev_queue_io_add+0x4d
>
>
>
>       nsec ------ Time Distribution ------ count     Stack
>
>        256 |                               17858     vdev_queue_io+0x8b
>
>        512 |@@                             87897
> zio_vdev_io_start+0x210
>
>       1024 |@@                             86224     zio_execute+0x90
>
>       2048 |@@@@@@@                        266282    zio_nowait+0x21
>
>       4096 |@@@@@@@@                       321714
> vdev_raidz_io_start+0x2ac
>
>       8192 |@@@                            127084    zio_vdev_io_start+0xb2
>
>      16384 |@                              72027     zio_execute+0x90
>
>      32768 |@                              49934     zio_nowait+0x21
>
>      65536 |                               35210
> vdev_mirror_io_start+0xcc
>
>     131072 |                               31214
>
>     262144 |                               22898
>
>     524288 |                               9376
>
>    1048576 |                               1039
>
>    2097152 |                               79
>
>    4194304 |                               8
>
>    8388608 |                               2
>
>
> -------------------------------------------------------------------------------
>
> Count indv cuml rcnt     nsec Lock                   Caller
>
> 1001623  11%  24% 0.00     9176 0xffffff1181c77960
> vdev_queue_io_remove+0x4d
>
>
>
>       nsec ------ Time Distribution ------ count     Stack
>
>        256 |                               5521
> vdev_queue_aggregate+0x2d2
>
>        512 |@                              50039
> vdev_queue_io_to_issue+0x6f
>
>       1024 |@                              50836
> vdev_queue_io_done+0x90
>
>       2048 |@@@@@@@@@                      320086    zio_vdev_io_done+0xde
>
>       4096 |@@@@@@@@@@@                    368465    zio_execute+0x90
>
>       8192 |@@                             92032     taskq_thread+0x2d0
>
>      16384 |@                              44114     thread_start+0x8
>
>      32768 |                               24243
>
>      65536 |                               16743
>
>     131072 |                               13658
>
>     262144 |                               10666
>
>     524288 |                               4649
>
>    1048576 |                               539
>
>    2097152 |                               29
>
>    4194304 |                               2
>
>    8388608 |                               1
>
>
> -------------------------------------------------------------------------------
>
> Count indv cuml rcnt     nsec Lock                   Caller
>
> 367598   4%  28% 0.00    20841 0xffffff1181c77960
> vdev_queue_pending_remove+0x54
>
>
>
>       nsec ------ Time Distribution ------ count     Stack
>
>        256 |                               7001
> vdev_queue_io_done+0x61
>
>        512 |@@                             34758     zio_vdev_io_done+0xde
>
>       1024 |@@                             33395     zio_execute+0x90
>
>       2048 |@@@@@                          67399     taskq_thread+0x2d0
>
>       4096 |@@@@@@@                        91609     thread_start+0x8
>
>       8192 |@@@                            45452
>
>      16384 |@@                             26110
>
>      32768 |@                              18888
>
>      65536 |@                              13833
>
>     131072 |@                              12772
>
>     262144 |                               10370
>
>     524288 |                               5098
>
>    1048576 |                               836
>
>    2097152 |                               67
>
>    4194304 |                               8
>
>    8388608 |                               2
>
>
>
> This appears to be due to the fact that each vdev queue is protected by a
> single global mutex, vq_lock, and in my case, the queue being hit heavily
> is the one being used by spa_sync() - asynchronous writes. The same pattern
> probably holds true for other queues as well - I have not tested this yet,
> but see no reason to believe otherwise.
>
>
>
> Has anyone else noticed this behavior, and if so, is this normal, or
> something that could potentially be improved for better performance or
> better CPU utilization?
>
>
>
> I can provide flamegraphs that I’ve generated if anyone is interested.
>
>
>
> Best regards,
>
> Kirill Davydychev
>
>
>
> _______________________________________________
>
> developer mailing list
>
> [email protected]
>
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
>
>
> _______________________________________________
> developer mailing list
> [email protected]
> http://lists.open-zfs.org/mailman/listinfo/developer
>
>
>
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to