Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Provide minor private interface modifications to support mntfs
    1.2. Name of Document Author/Supplier:
         Author:  Robert Harris
    1.3  Date of This Document:
        16 October, 2009
4. Technical Description

This information is Copyright 2009 Sun Microsystems
1. Introduction
   1.1. Project/Component Working Name:
     Provide minor private interface modifications to support mntfs.
   1.2. Name of Document Author/Supplier:
     Author:  Robert Harris
   1.3  Date of This Document:
    13 October, 2009
4. Technical Description

1. Proposal:

    Provide minor private interface modifications to support mntfs.


2. The Problem:

    The contents of /etc/mnttab are created by mntfs on demand.
    mntfs parses the in-kernel mnttab structures to create a
    snapshot that can be used to satisfy subsequent calls to
    read() or ioctl(). The snapshot is stored by the kernel
    within the address space of the process that made the first
    call to read() or ioctl(). The enclosing mapping is removed
    from the calling process's address space by mntfs on last
    close().

    The snapshot-in-userland design has a flaw: the kernel cannot
    determine whether or not a close() is a specific process's
    last if the vnode count is greater than 1. This is because
    there is no way to determine whether a count that is greater
    than one has originated from dup(), from fork() or from
    both.

    This means that mntfs is unable to ensure that every
    insertion of a mapping into a process's address space is
    paired with a corresponding deletion. Two specific
    manifestations are 6394241, in which a newly-execed process
    has an arbitrary range of its address space unmapped by
    mntfs, and 6813502, in which a process address space is
    entirely consumed by orphaned mappings left behind by mntfs.


3. Solutions:

    The most obvious solution seemed, at first, to involve
    storing the snapshot data within the corresponding vnode,
    thereby allowing the existing file system infrastructure to
    free the resources when no longer required. This, however,
    was rejected on account of complications inherent in the
    unprivileged user's resulting ability to allocate and retain
    kernel memory.

    It was previously believed that there remained no alternative
    other than to abandon the use of snapshots in their current
    form. The approach would necessitate a change to the behaviour
    of /etc/mnttab and its API and resulted in an earlier PSARC
    case, 2009/352.

    Although case 2009/352 was approved, comments exchanged during
    its review have led to the design of a solution that retains
    all of the existing documented behaviour and yet has minimal
    consumption of kernel memory. This solution has been adopted
    as the preferred approach.

    Very briefly, the new proposal effects a snapshot by
    constructing a per-zone "database" that encapsulates the
    different states of the in-kernel mnttab that are visible to
    existing consumers. The database takes the form of a linked list
    where every element represents an entry in /etc/mnttab and
    has a time of birth and a time of death.

    By providing appropriate time stamps to each element, a
    consumer need remember only the time at which his own
    view was created. This view, i.e. snapshot, can be generated
    on demand by walking through the database and extracting all
    elements that were "born" before, but that "died" after, the
    snapshot creation time. Elements are removed when they are
    no longer referenced by any existing consumer, and so the
    database need not exist at all.


4. Impact:

4.1 Overview:

    This solution has some modest requirements. The database is
    maintained on a per-zone basis, and so the zone_t will acquire
    two new fields: a pointer to the database and a lock. Two
    new private ioctl() commands will be added, MNTIOC_GETEXTMNTENT
    and MNTIOC_GETMNTANY, to ensure that the getmntent(3C) family
    of functions can be serviced as efficiently as possible.

    More delicate is the need for every vfs_t present in the
    in-kernel mnttab to have a high-resolution time stamp
    indicating its time of creation (not its mount time). The
    vfs_t is unusual in that it is exposed to unbundled file
    systems, and is therefore considered dangerous to modify. To
    this end, following PSARC 2006/270, there now exists a
    vfs_impl_t, referenced by a vfs_t's vfs_implp, that is designed
    to accommodate additional fields that would otherwise occupy
    the vfs_t. As part of this change, the vfs_impl_t will acquire
    a new field: a high resolution time stamp.

    The new time stamp in the vfs_t's vfs_impl_t will be
    initialised in vfs_list_add(), a private function that inserts
    a mounted vfs_t into the in-kernel mnttab. Because the
    time stamp will be mandatory, vfs_list_add() will be modified
    to supply a vfs_t with a vfs_impl_t if it does not already
    have one. This is significant because unexpected behaviour
    could occur if an unbundled file system allocates its own
    vfs_t for insertion into the in-kernel mnttab. Consequently,
    developers of the unbundled file systems PxFS, VxFS, QFS and
    MVFS have been approached and have provided confirmation that
    the proposed changes will be harmless. Developers of OpenAFS
    have been approached but have not responded. File systems in ON
    are unaffected.
       
4.2 Interface changes:

    1. The vfs_impl struct acquires a new member, vi_hrctime,
       which is the high-resolution creation time of the
       corresponding vfs_t.

    2. vfs_list_add() will supply a vfs_t with a vfs_impl_t if it
       does not already have one.

    3. The zone struct acquires two new members, zone_mntfs_db
       and zone_mntfs_db_lock. These implement the per-zone database
       described in section 3.

    4. The existing ioctl() command MNTIOC_GETMNTENT will be
       modified and two new ioctl() commands, MNTIOC_GETEXTMNTENT
       and MNTIOC_GETMNTANY, will be created. These changes will
       support the getmntent(3C) family of functions. Note that
       they will create a backwards-incompatibility likely to
       affect S10-branded zones.

    These interfaces will all be Consolidation Private.

4.3 Other

    Once approved, this case will supercede PSARC/2009/352.


5. Release binding:

    Patch.


6. Documentation impact:

    None.


7. References:

1. CR 6394241 mntfs is not exec safe

2. CR 6813502 mntfs is not fork-safe
program.

3. PSARC 2009/352.

4. PSARC 2006/270.


6. Resources and Schedule
   6.4. Steering Committee requested information
      6.4.1. Consolidation C-team Name:
        ON
   6.5. ARC review type: FastTrack
   6.6. ARC Exposure: open



6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

Reply via email to