Hi.

On 30.11.2013 00:35, Kirill Davydychev wrote:
I’m doing some performance testing of the new ZFS I/O scheduler (Illumos
4045 and friends) on a fairly current build of the Illumos kernel. It is
maybe 2 weeks old, but that should not matter much.

As part of my tests, I’m sending/receiving multi-TB datasets between two
fairly fast systems, and making 10-minute flamegraphs of the kernel on
the receive side, capturing where we spend most time during the 100%
async write scenario.

The destination pool has 3 raidz1 vdevs with 5 7.2k rpm drives each;
source system is identical, but also has an all-SSD pool which I’m using
to generate a higher zfs send throughput to really stress the
destination. The write throughput is therefore around 100MB/sec in one
test, and around 400MB/sec in the other.

I am observing that in both cases, while of course there is time spent
elsewhere in the kernel, there is a very distinct pattern of high mutex
contention in vdev_queue_io(), to a point where this function spends
between 73% and 82% of its sampled stacks in mutex_vector_enter(), and
under 20% actually in vdev_queue_io_add().

It gets worse with more load on the system - the 82% sample was taken
during the 400MB/s test run.. Lockstat traces also confirm the observed
behavior:

Count indv cuml rcnt     nsec Lock                   Caller

1128846  13%  13% 0.00    15035 0xffffff1181c77960
vdev_queue_io_add+0x4d

       nsec ------ Time Distribution ------ count     Stack

        256 |                               17858     vdev_queue_io+0x8b

        512 |@@                             87897
zio_vdev_io_start+0x210

       1024 |@@                             86224     zio_execute+0x90

       2048 |@@@@@@@                        266282    zio_nowait+0x21

       4096 |@@@@@@@@                       321714
vdev_raidz_io_start+0x2ac

       8192 |@@@                            127084    zio_vdev_io_start+0xb2

      16384 |@                              72027     zio_execute+0x90

      32768 |@                              49934     zio_nowait+0x21

      65536 |                               35210
vdev_mirror_io_start+0xcc

     131072 |                               31214

     262144 |                               22898

     524288 |                               9376

    1048576 |                               1039

    2097152 |                               79

    4194304 |                               8

    8388608 |                               2

-------------------------------------------------------------------------------

Count indv cuml rcnt     nsec Lock                   Caller

1001623  11%  24% 0.00     9176 0xffffff1181c77960
vdev_queue_io_remove+0x4d

       nsec ------ Time Distribution ------ count     Stack

        256 |                               5521
vdev_queue_aggregate+0x2d2

        512 |@                              50039
vdev_queue_io_to_issue+0x6f

       1024 |@                              50836
vdev_queue_io_done+0x90

       2048 |@@@@@@@@@                      320086    zio_vdev_io_done+0xde

       4096 |@@@@@@@@@@@                    368465    zio_execute+0x90

       8192 |@@                             92032     taskq_thread+0x2d0

      16384 |@                              44114     thread_start+0x8

      32768 |                               24243

      65536 |                               16743

     131072 |                               13658

     262144 |                               10666

     524288 |                               4649

    1048576 |                               539

    2097152 |                               29

    4194304 |                               2

    8388608 |                               1

-------------------------------------------------------------------------------

Count indv cuml rcnt     nsec Lock                   Caller

367598   4%  28% 0.00    20841 0xffffff1181c77960
vdev_queue_pending_remove+0x54

       nsec ------ Time Distribution ------ count     Stack

        256 |                               7001
vdev_queue_io_done+0x61

        512 |@@                             34758     zio_vdev_io_done+0xde

       1024 |@@                             33395     zio_execute+0x90

       2048 |@@@@@                          67399     taskq_thread+0x2d0

       4096 |@@@@@@@                        91609     thread_start+0x8

       8192 |@@@                            45452

      16384 |@@                             26110

      32768 |@                              18888

      65536 |@                              13833

     131072 |@                              12772

     262144 |                               10370

     524288 |                               5098

    1048576 |                               836

    2097152 |                               67

    4194304 |                               8

    8388608 |                               2

This appears to be due to the fact that each vdev queue is protected by
a single global mutex, vq_lock, and in my case, the queue being hit
heavily is the one being used by spa_sync() - asynchronous writes. The
same pattern probably holds true for other queues as well - I have not
tested this yet, but see no reason to believe otherwise.

Has anyone else noticed this behavior, and if so, is this normal, or
something that could potentially be improved for better performance or
better CPU utilization?

I can provide flamegraphs that I’ve generated if anyone is interested.

I've also noticed strong vdev queue lock congestion with that code on FreeBSD, which I don't remember in my previous tests:

14.15%  [201]      _sx_xlock_hard @ /boot/kernel/kernel
 84.08%  [169]       vdev_queue_io
  100.0%  [169]        zio_vdev_io_start
   100.0%  [169]         zio_execute
    100.0%  [169]          vdev_mirror_io_start
     100.0%  [169]           zio_vdev_io_start
      100.0%  [169]            zio_execute
       100.0%  [169]             taskqueue_run_locked


I don't know all the locking logic there to comment, just noticed that these four multiplications done for every I/O don't look very needed:

static int
vdev_queue_max_async_writes(uint64_t dirty)
{
        int writes;
        uint64_t min_bytes = zfs_dirty_data_max *
            zfs_vdev_async_write_active_min_dirty_percent / 100;
        uint64_t max_bytes = zfs_dirty_data_max *
            zfs_vdev_async_write_active_max_dirty_percent / 100;

--
Alexander Motin
_______________________________________________
developer mailing list
[email protected]
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to