Matt,


Combined response to your two e-mails:



Ø  The flame graph does show that.  I'm surprised that vdev_queue_io_done()
doesn't also show up, since it is also grabbing vq_lock.



vdev_queue_io_done() does show up, but with very few samples waiting for
the mutex - under 2%.



Ø  The lockstat data below does not show any contention in vdev_queue_io.
 It shows contention on the spa_iokstat_lock from
vdev_queue_io_{add,remove}() and vdev_queue_pending_remove().  There is
only one of these locks per pool, but the hold times should be very short.



I made a mistake in pasting lockstat output by only providing the top 3
lock takers by count - not by duration. Full lockstat trace from my latest
test is at https://gist.github.com/kdavyd/7782663. It appears that I’ve
managed to get the locks confused. The functions you mentioned are at the
top in terms of callers, and are indeed holding spa_iokstat_lock. For
vdev_queue_io(), there are multiple instances taking different locks, and
I’ve confirmed via mdb that the vq_locks in question are for the leaf vdevs
- for instance one of the locks below:



Count indv cuml rcnt     nsec Lock
Caller

93019   2%  57% 0.00   129204 0xffffff19c3f97810     vdev_queue_io+0x74



> ::spa -v ! grep ffffff19c3f972c0

    ffffff19c3f972c0 HEALTHY   -
/dev/dsk/c0t5000C50040D97E1Bd0s0



> ffffff19c3f972c0::print -t -a vdev_t vdev_queue.vq_lock

ffffff19c3f97810 kmutex_t vdev_queue.vq_lock = {

    ffffff19c3f97810 void *[1] vdev_queue.vq_lock._opaque = [ 0 ]



The locks in vdev_queue_io() take significantly longer to take, which
explains this function standing out in the flamegraph - but is this
expected behavior?



Ø  How many CPUs?  What is the recordsize?



24 logical CPUs (dual Xeon X5660), 16K recordsize.



Ø  The queues are protected by one lock per leaf vdev (vq_lock).  For what
it's worth, this is be no different from before the new i/o scheduler.



I should probably do a test on the old scheduler then, and compare the
behavior. I understand that the queues are protected by a single lock; the
reason for my investigation is to understand how the write I/O path can be
improved to better utilize the hardware at hand, since in all my tests on
fast hardware, I cannot get individual drives anywhere close to 100%
utilization, even after disabling CPU-consuming features such as
compression and checksums.



Ø  FWIW, George is working on some changes that should reduce the number of
entries in the vqc_queued_tree, which should reduce hold times on vq_lock.



I am certainly open to testing anything that might change the behavior, for
better or for worse.



Ø  Could you use lockstat to gather the hold and wait times for one of the
vdev's vq_lock?



Results from “lockstat -n 1000000000 -CH -s10 -l 0xffffff19c3f97810 sleep
120” for the same lock mentioned above are at
https://gist.github.com/kdavyd/7782953. If I understand this correctly,
vdev_queue_io_done() is the one holding the lock for significantly longer
than others; vdev_queue_io()’s critical section is much quicker in
comparison.



Best regards,

Kirill Davydychev



*From:* Matthew Ahrens [mailto:[email protected]]
*Sent:* Tuesday, December 03, 2013 10:52 PM
*To:* Kirill Davydychev
*Cc:* George Wilson; developer
*Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io()



On Tue, Dec 3, 2013 at 7:57 PM, Kirill Davydychev <
[email protected]> wrote:

Thanks for the suggestion George, I have tried this with the fix for 4347
as well.



It looks like it didn’t make a difference for this particular lock - in the
one test I ran, similar behavior was observed:



12.17% of non-idle kernel time is spent in vdev_queue_io(); 78% of time in
that function is spent waiting on vq_lock.



The flame graph does show that.  I'm surprised that vdev_queue_io_done()
doesn't also show up, since it is also grabbing vq_lock.



How many CPUs?  What is the recordsize?



FWIW, George is working on some changes that should reduce the number of
entries in the vqc_queued_tree, which should reduce hold times on vq_lock.
 Could you use lockstat to gather the hold and wait times for one of the
vdev's vq_lock?



--matt







Lockstat looks same as before. I’m not sure if the mailing list allows
attachments, but I’ve attached the resulting flamegraph. It was done during
a regular zfs send/recv from a fast pool, with the destination system being
traced at profile-997 for 6 minutes, then filtering out “idle”:







dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); }
tick-360s { exit(0); }' -o post-4347.out





Best regards,

Kirill Davydychev



*From:* George Wilson [mailto:[email protected]]
*Sent:* Monday, December 02, 2013 3:24 PM
*To:* Kirill Davydychev; [email protected]
*Subject:* Re: [OpenZFS Developer] Mutex contention in vdev_queue_io()



The linux folks also discovered lock contention when testing this
initially. The cause was a missing upstream push on our part so make sure
you have the fix for '4347 ZPL can use dmu_tx_assign(TXG_WAIT)' if you are
testing out the new write scheduler.

Thanks,
George

On 11/29/13 1:02 PM, Kirill Davydychev wrote:

Hi,



I’m doing some performance testing of the new ZFS I/O scheduler (Illumos
4045 and friends) on a fairly current build of the Illumos kernel. It is
maybe 2 weeks old, but that should not matter much.



As part of my tests, I’m sending/receiving multi-TB datasets between two
fairly fast systems, and making 10-minute flamegraphs of the kernel on the
receive side, capturing where we spend most time during the 100% async
write scenario.



The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each; source
system is identical, but also has an all-SSD pool which I’m using to
generate a higher zfs send throughput to really stress the destination. The
write throughput is therefore around 100MB/sec in one test, and around
400MB/sec in the other.



I am observing that in both cases, while of course there is time spent
elsewhere in the kernel, there is a very distinct pattern of high mutex
contention in vdev_queue_io(), to a point where this function spends
between 73% and 82% of its sampled stacks in mutex_vector_enter(), and
under 20% actually in vdev_queue_io_add().



It gets worse with more load on the system - the 82% sample was taken
during the 400MB/s test run... Lockstat traces also confirm the observed
behavior:



Count indv cuml rcnt     nsec Lock                   Caller

1128846  13%  13% 0.00    15035 0xffffff1181c77960
vdev_queue_io_add+0x4d



      nsec ------ Time Distribution ------ count     Stack

       256 |                               17858     vdev_queue_io+0x8b

       512 |@@                             87897     zio_vdev_io_start+0x210

      1024 |@@                             86224     zio_execute+0x90

      2048 |@@@@@@@                        266282    zio_nowait+0x21

      4096 |@@@@@@@@                       321714
vdev_raidz_io_start+0x2ac

      8192 |@@@                            127084    zio_vdev_io_start+0xb2

     16384 |@                              72027     zio_execute+0x90

     32768 |@                              49934     zio_nowait+0x21

     65536 |                               35210
vdev_mirror_io_start+0xcc

    131072 |                               31214

    262144 |                               22898

    524288 |                               9376

   1048576 |                               1039

   2097152 |                               79

   4194304 |                               8

   8388608 |                               2

-------------------------------------------------------------------------------

Count indv cuml rcnt     nsec Lock                   Caller

1001623  11%  24% 0.00     9176 0xffffff1181c77960
vdev_queue_io_remove+0x4d



      nsec ------ Time Distribution ------ count     Stack

       256 |                               5521
vdev_queue_aggregate+0x2d2

       512 |@                              50039
vdev_queue_io_to_issue+0x6f

      1024 |@                              50836     vdev_queue_io_done+0x90

      2048 |@@@@@@@@@                      320086    zio_vdev_io_done+0xde

      4096 |@@@@@@@@@@@                    368465    zio_execute+0x90

      8192 |@@                             92032     taskq_thread+0x2d0

     16384 |@                              44114     thread_start+0x8

     32768 |                               24243

     65536 |                               16743

    131072 |                               13658

    262144 |                               10666

    524288 |                               4649

   1048576 |                               539

   2097152 |                               29

   4194304 |                               2

   8388608 |                               1

-------------------------------------------------------------------------------

Count indv cuml rcnt     nsec Lock                   Caller

367598   4%  28% 0.00    20841 0xffffff1181c77960
vdev_queue_pending_remove+0x54



      nsec ------ Time Distribution ------ count     Stack

       256 |                               7001      vdev_queue_io_done+0x61

       512 |@@                             34758     zio_vdev_io_done+0xde

      1024 |@@                             33395     zio_execute+0x90

      2048 |@@@@@                          67399     taskq_thread+0x2d0

      4096 |@@@@@@@                        91609     thread_start+0x8

      8192 |@@@                            45452

     16384 |@@                             26110

     32768 |@                              18888

     65536 |@                              13833

    131072 |@                              12772

    262144 |                               10370

    524288 |                               5098

   1048576 |                               836

   2097152 |                               67

   4194304 |                               8

   8388608 |                               2



This appears to be due to the fact that each vdev queue is protected by a
single global mutex, vq_lock, and in my case, the queue being hit heavily
is the one being used by spa_sync() - asynchronous writes. The same pattern
probably holds true for other queues as well - I have not tested this yet,
but see no reason to believe otherwise.



Has anyone else noticed this behavior, and if so, is this normal, or
something that could potentially be improved for better performance or
better CPU utilization?



I can provide flamegraphs that I’ve generated if anyone is interested.



Best regards,

Kirill Davydychev



_______________________________________________

developer mailing list

[email protected]

http://lists.open-zfs.org/mailman/listinfo/developer




_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to