Below is partial results from a profile of a parallel (-j7) "buildworld" on
a 6-core machine that I did after the introduction of pmap_advise, so this
is not a new profile.  The results are sorted by total waiting time and
only the top 20 entries are listed.

     max  wait_max       total  wait_total       count    avg wait_avg
cnt_hold cnt_lock name

    1027    208500    16292932  1658585700     5297163      3    313  0
3313855 kern/vfs_cache.c:629 (rw:Name Cache)

  208564    186514 19080891106  1129189627   355575930     53      3  0
1323051 kern/vfs_subr.c:2099 (lockmgr:ufs)

  169241    148057   193721142   419075449    13819553     14     30  0
110089 kern/vfs_subr.c:2210 (lockmgr:ufs)

  187092    191775  1923061952   257319238   328416784      5      0  0
5106537 kern/vfs_cache.c:488 (rw:Name Cache)

      23       114   134681925   220476269    40747213      3      5  0
25679721 kern/kern_clocksource.c:233 (spin mutex:et_hw_mtx)

   39069    101543  1931226072   208764524   482193429      4      0  0
22375691 kern/vfs_subr.c:2177 (sleep mutex:vnode interlock)

  187131    187056  2509403648   140794697   298324050      8      0  0
14386756 kern/vfs_cache.c:669 (sleep mutex:vnode interlock)

    1421    257059   260943147   139520512   104936165      2      1  0
12997640 vm/vm_page.c:1225 (sleep mutex:vm page free queue)

   39612    145747   371125327   121005252   136149528      2      0  0
8280782 kern/vfs_subr.c:2134 (sleep mutex:vnode interlock)

    1720    249735   226621512    91906907    93436933      2      0  0
7092634 vm/vm_page.c:1770 (sleep mutex:vm active pagequeue)

  394155    394200   330090749    86368442    48766123      6      1  0
1169061 kern/vfs_hash.c:78 (sleep mutex:vfs hash)

     892     93103     3446633    75923096     1482518      2     51  0
236865 kern/vfs_cache.c:799 (rw:Name Cache)

    4030    394151   395521192    63355061    47860319      8      1  0
6439221 kern/vfs_hash.c:86 (sleep mutex:vnode interlock)

    4554    147798   247338596    56263926   104192514      2      0  0
9455460 vm/vm_page.c:1948 (sleep mutex:vm page free queue)

    2587    230069   219652081    48271335    94011085      2      0  0
9011261 vm/vm_page.c:1729 (sleep mutex:vm active pagequeue)

   16420     50195   920083075    38568487   347596869      2      0  0
3035672 kern/vfs_subr.c:2107 (sleep mutex:vnode interlock)

   57348     93913    65957615    31956482     2487620     26     12  0
39048 vm/vm_fault.c:672 (rw:vm object)

    1798     93694   127847964    28490515    46510308      2      0  0
1897724 kern/vfs_subr.c:419 (sleep mutex:struct mount mtx)

  249739    207227   775356648    25501046    95007901      8      0  0
211559 vm/vm_fault.c:918 (sleep mutex:vm page)

  452130    157222    70439287    18564724     5429942     12      3  0
10813 vm/vm_map.c:2738 (rw:vm object)


On Thu, Mar 12, 2015 at 12:36 PM, Mateusz Guzik <mjgu...@gmail.com> wrote:

> On Thu, Mar 12, 2015 at 11:14:42AM -0400, Ryan Stone wrote:
> > I've just submitted a patch to Differential[1] for review that converts
> the
> > VFS cache to use an rmlock in place of the current rwlock.  My main
> > motivation for the change is to fix a priority inversion problem that I
> saw
> > recently.  A real-time priority thread attempted to acquire a write lock
> on
> > the VFS cache lock, but there was already a reader holding it.  The
> reader
> > was preempted by a normal priority thread, and my real-time thread was
> > starved.
> >
> > [1] https://reviews.freebsd.org/D2051
> >
> >
> > I was worried about the performance implications of the change, as I
> wasn't
> > sure how common write operations on the VFS cache would be.  I did a -j12
> > buildworld/buildkernel test on a 12-core Haswell Xeon system, as I
> figured
> > that would be a reasonable stress test that simultaneously creates lots
> of
> > small files and reads a lot of files as well.  This actually wound up
> being
> > about a 10% performance *increase* (the units below are seconds of
> elapsed
> > time as measured by /usr/bin/time, so smaller is better):
> >
> > $ ministat -C 1 orig.log rmlock.log
> > x orig.log
> > + rmlock.log
> >
> +------------------------------------------------------------------------------+
> > |  +
>  x
> >     |
> > |++++                                            x                    x
> xxx
> >    |
> > | |A|
> >  |_________A___M____||
> >
> +------------------------------------------------------------------------------+
> >     N           Min           Max        Median           Avg
> Stddev
> > x   6       2710.31       2821.35       2816.75     2798.0617
>  43.324817
> > +   5       2488.25       2500.25       2498.04      2495.756
>  5.0494782
> > Difference at 95.0% confidence
> >         -302.306 +/- 44.4709
> >         -10.8041% +/- 1.58935%
> >         (Student's t, pooled s = 32.4674)
> >
> > The one outlier in the rwlock case does confuse me a bit.  What I did was
> > booted a freshly-built image with the rmlock lock applied, did a git
> > checkout of head, and then did 5 builds in a row.  The git checkout
> should
> > have had the effect of priming the disk cache with the source files.
> Then
> > I installed the stock head kernel, rebooted, and ran 5 more builds (and
> > then 1 more when I noticed the outlier).  The fast outlier was the
> *first*
> > run, which should have been running with a cold disk cache, so I really
> > don't know why it would be 90 seconds faster.  I do see that this run
> also
> > had about 500-600 fewer seconds spent in system time:
> >
> > x orig.log
> >
> +------------------------------------------------------------------------------+
> > |
> > x             |
> > |x                                                        x   x
> > xx             |
> > |
> > |_________________________A__________M_____________||
> >
> +------------------------------------------------------------------------------+
> >     N           Min           Max        Median           Avg
> Stddev
> > x   6       3515.23       4121.84       4105.57       4001.71
>  239.61362
> >
> > I'm not sure how much that I care, given that the rmlock is universally
> > faster (but maybe I should try the "cold boot" case anyway).
> >
> > If anybody had any comments or further testing that they would like to
> see,
> > please let me know.
>
> Workloads like buildworld and the like (i.e. a lot of forks + execs) run
> into very severe contention in vm, which is orders of magnitude bigger
> than anything else.
>
> As such your result seems quite suspicious.
>
> Can you describe in more detail how were you testing?
>
> Did you have a separate fs for obj tree which was mounted+unmounted
> before each run?
>
> I suggest you grab a machine from zoo[1] and run some tests on "bigger"
> hardware.
>
> A perf improvement, even slight, is definitely welcome.
>
> [1] https://wiki.freebsd.org/TestClusterOneReservations
>
> --
> Mateusz Guzik <mjguzik gmail.com>
> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
>
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to