Reviewed by: Matthew Ahrens mahr...@delphix.com
Reviewed by: Pavel Zakharov pavel.zakha...@delphix.com
Reviewed by: Brad Lewis <brad.le...@delphix.com>
Reviewed by: George Wilson <george.wil...@delphix.com>
Reviewed by: Paul Dagnelie <p...@delphix.com>
Reviewed by: Prashanth Sreenivasa <p...@delphix.com>

Overview

In analyzing the time it takes for a Delphix Engine to come up following
a planned or unplanned reboot, we've determined that the SMF service
(filesystem/local) that's responsible for mounting all local filesystems
(except for /) is responsible for a significant percentage of the boot
time. The longer it takes for the Delphix Engine to come up, the longer
the Delphix Engine is unavailable during these outages. For example, on
a Delphix Engine with roughly 3000 filesystems, we have the following
breakdown of "filesystem/local" start time for a sample of 74 reboots:

    # NumSamples = 74; Min = 0.00; Max = 782.00
    # Mean = 186.972973; Variance = 17853.891161; SD = 133.618454; Median 
156.000000
    # each * represents a count of 1
        0.0000 -    78.2000 [    10]: **********
       78.2000 -   156.4000 [    27]: ***************************
      156.4000 -   234.6000 [    17]: *****************
      234.6000 -   312.8000 [     8]: ********
      312.8000 -   391.0000 [     8]: ********
      391.0000 -   469.2000 [     1]: *
      469.2000 -   547.4000 [     1]: *
      547.4000 -   625.6000 [     1]: *
      625.6000 -   703.8000 [     0]:
      703.8000 -   782.0000 [     1]: *

On average, it takes over 3 minutes to mount local filesystems on that
system. A sampling of 56 reboots on another system which has 9000+
filesystems is below:

    # NumSamples = 56; Min = 0.00; Max = 1377.00
    # Mean = 175.250000; Variance = 54092.223214; SD = 232.577349; Median 
118.000000
    # each * represents a count of 1
        0.0000 -   137.7000 [    37]: *************************************
      137.7000 -   275.4000 [    11]: ***********
      275.4000 -   413.1000 [     4]: ****
      413.1000 -   550.8000 [     1]: *
      550.8000 -   688.5000 [     1]: *
      688.5000 -   826.2000 [     0]:
      826.2000 -   963.9000 [     0]:
      963.9000 -  1101.6000 [     1]: *
     1101.6000 -  1239.3000 [     0]:
     1239.3000 -  1377.0000 [     1]: *

Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`,
which mounts each filesystems serially. The bottleneck for each mount is
the I/O done to load metadata for each filesystem. As such, mounting
filesystems using a parallel algorithm should be a big win, and bring down
the runtime of "filesystem/local"'s start method.

Performance Testing: System Configuration

To test and verify these changes impacted performance how we expected it
to, we used a VM with:

  - 8 vCPUs
  - zpool with 10 10k-SAS disks
  - filesystem hierarchy like so:

        1 pool     2 groups  100 containers  2 timeflows    5 leaf datasets
                               per group     per container  per timeflow
        test-pool-+-group-0-+-container-0-+---timeflow-0---+-ds-0
                  |         |             |                +-ds-1
                  |         |             |                +-ds-2
                  |         |             |                +-ds-3
                  |         |             |                +-ds-4
                  |         |             |
                  |         |             +---timeflow-1---+-ds-0
                  |         |                              +-ds-1
                  |         |                              +-ds-2
                  |         |                              +-ds-3
                  |         |                              +-ds-4
                  |         |
                  |         +-container-1-+---timeflow-0---+-ds-0
                  |         |             |                +-ds-1
                  |         |             |                +-ds-2
                  |         |             |                +-ds-3
                  |         |             |                +-ds-4
                  |         |             |
                  |         |             +---timeflow-1---+-ds-0
                  |         |                              +-ds-1
                  |         |                              +-ds-2
                  |         |                              +-ds-3
                  |         |                              +-ds-4
                  |         + ...
                  |         .
                  |         .
                  |
                  +-group-1 ...

This makes for a total of 2603 filesystems:

    pool + groups + containers + timeflows + leaves
    1    + 2      + 2*100      + 2(2*100)  + 5(2(2*100)) = 2603 filesystems

Additionally, a 1MB file was created in each leaf dataset.

Because this filesystem heirarchy is not very deep, this lends itself
well to the new parallel mounting algorithm implemented.

Performance Testing: Methodology and Results

The system described above was rebooted 10 times, and the duration of
the start method of "filesystem/local" was measured. Specifically, the
"zfs mount -va" comamnd that it calls was instrumented to break down the
phases of the mounting process into three buckets:

  1. gathering the list of filesystems to mount (aka "load")
  2. mounting all filesystems (aka "mount")
  3. left-over time spent doing anything else (aka "other")

The results of these measurements is below:

           | other (s) | load (s) | mount (s) |
       ----+-----------+----------+-----------+
    Before |    1.5    |    8.1   |    45.5   |
       ----+-------+------+-------+-----------+
     After |    1.7    |    7.9   |    2.1    |
       ----+-----------+----------+-----------+

In summary, for this configuration, the filesystem/local SMF services
goes from taking an average of 55.1 seconds (+/- 1.0s) to an average of
11.7 seconds (+/- 0.8s). The "other" and "load" times remain unchanged
(unsurprising given that this project hasn't touched any code in those
areas).

The big win comes in the "mount" phase, which reduces the time from
roughly 45 seconds to 2 seconds; a 95% decrease in latency.

Using the same zpool as above, "zpool import" performance was also
tested; the mounting done by "zpool import" now uses the same framework
as "zfs mount -a". Performance improvement for this case is unsurprisingly
on par with the "zfs mount -a" improvement documented above.

Upstream bugs: DLPX-46555, DLPX-49847, DLPX-49351, 38457
You can view, comment on, or merge this pull request online at:

  https://github.com/openzfs/openzfs/pull/536

-- Commit Summary --

  * 8115 parallel zfs mount (v2)

-- File Changes --

    M usr/src/cmd/zfs/Makefile (3)
    M usr/src/cmd/zfs/zfs_main.c (122)
    M usr/src/lib/Makefile (2)
    A usr/src/lib/libfakekernel/common/synch.h (25)
    M usr/src/lib/libzfs/Makefile.com (7)
    M usr/src/lib/libzfs/common/libzfs.h (5)
    M usr/src/lib/libzfs/common/libzfs_dataset.c (30)
    M usr/src/lib/libzfs/common/libzfs_impl.h (9)
    M usr/src/lib/libzfs/common/libzfs_mount.c (408)
    M usr/src/lib/libzfs/common/mapfile-vers (4)
    D usr/src/lib/libzfs/common/sys/zfs_context.h (37)
    M usr/src/pkg/manifests/system-test-zfstest.mf (5)
    M usr/src/test/zfs-tests/runfiles/delphix.run (2)
    M 
usr/src/test/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount.kshlib (8)
    A 
usr/src/test/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_all_fail.ksh
 (96)
    A 
usr/src/test/zfs-tests/tests/functional/cli_root/zfs_mount/zfs_mount_all_mountpoints.ksh
 (162)
    M usr/src/uts/common/fs/hsfs/hsfs_vfsops.c (3)
    M usr/src/uts/common/fs/pcfs/pc_vfsops.c (5)
    M usr/src/uts/common/fs/udfs/udf_vfsops.c (19)
    M usr/src/uts/common/fs/ufs/ufs_vfsops.c (8)
    M usr/src/uts/common/fs/vfs.c (8)
    M usr/src/uts/common/fs/zfs/sys/dsl_pool.h (2)
    M usr/src/uts/common/sys/vfs.h (3)

-- Patch Links --

https://github.com/openzfs/openzfs/pull/536.patch
https://github.com/openzfs/openzfs/pull/536.diff

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/openzfs/openzfs/pull/536

------------------------------------------
openzfs-developer
Archives: 
https://openzfs.topicbox.com/groups/developer/discussions/T22334a01fda83bfd-Mb3c093e0f6c130ef57302a87
Powered by Topicbox: https://topicbox.com

Reply via email to