Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Provide minor private interface modifications to support mntfs 1.2. Name of Document Author/Supplier: Author: Robert Harris 1.3 Date of This Document: 16 October, 2009 4. Technical Description
This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Provide minor private interface modifications to support mntfs. 1.2. Name of Document Author/Supplier: Author: Robert Harris 1.3 Date of This Document: 13 October, 2009 4. Technical Description 1. Proposal: Provide minor private interface modifications to support mntfs. 2. The Problem: The contents of /etc/mnttab are created by mntfs on demand. mntfs parses the in-kernel mnttab structures to create a snapshot that can be used to satisfy subsequent calls to read() or ioctl(). The snapshot is stored by the kernel within the address space of the process that made the first call to read() or ioctl(). The enclosing mapping is removed from the calling process's address space by mntfs on last close(). The snapshot-in-userland design has a flaw: the kernel cannot determine whether or not a close() is a specific process's last if the vnode count is greater than 1. This is because there is no way to determine whether a count that is greater than one has originated from dup(), from fork() or from both. This means that mntfs is unable to ensure that every insertion of a mapping into a process's address space is paired with a corresponding deletion. Two specific manifestations are 6394241, in which a newly-execed process has an arbitrary range of its address space unmapped by mntfs, and 6813502, in which a process address space is entirely consumed by orphaned mappings left behind by mntfs. 3. Solutions: The most obvious solution seemed, at first, to involve storing the snapshot data within the corresponding vnode, thereby allowing the existing file system infrastructure to free the resources when no longer required. This, however, was rejected on account of complications inherent in the unprivileged user's resulting ability to allocate and retain kernel memory. It was previously believed that there remained no alternative other than to abandon the use of snapshots in their current form. The approach would necessitate a change to the behaviour of /etc/mnttab and its API and resulted in an earlier PSARC case, 2009/352. Although case 2009/352 was approved, comments exchanged during its review have led to the design of a solution that retains all of the existing documented behaviour and yet has minimal consumption of kernel memory. This solution has been adopted as the preferred approach. Very briefly, the new proposal effects a snapshot by constructing a per-zone "database" that encapsulates the different states of the in-kernel mnttab that are visible to existing consumers. The database takes the form of a linked list where every element represents an entry in /etc/mnttab and has a time of birth and a time of death. By providing appropriate time stamps to each element, a consumer need remember only the time at which his own view was created. This view, i.e. snapshot, can be generated on demand by walking through the database and extracting all elements that were "born" before, but that "died" after, the snapshot creation time. Elements are removed when they are no longer referenced by any existing consumer, and so the database need not exist at all. 4. Impact: 4.1 Overview: This solution has some modest requirements. The database is maintained on a per-zone basis, and so the zone_t will acquire two new fields: a pointer to the database and a lock. Two new private ioctl() commands will be added, MNTIOC_GETEXTMNTENT and MNTIOC_GETMNTANY, to ensure that the getmntent(3C) family of functions can be serviced as efficiently as possible. More delicate is the need for every vfs_t present in the in-kernel mnttab to have a high-resolution time stamp indicating its time of creation (not its mount time). The vfs_t is unusual in that it is exposed to unbundled file systems, and is therefore considered dangerous to modify. To this end, following PSARC 2006/270, there now exists a vfs_impl_t, referenced by a vfs_t's vfs_implp, that is designed to accommodate additional fields that would otherwise occupy the vfs_t. As part of this change, the vfs_impl_t will acquire a new field: a high resolution time stamp. The new time stamp in the vfs_t's vfs_impl_t will be initialised in vfs_list_add(), a private function that inserts a mounted vfs_t into the in-kernel mnttab. Because the time stamp will be mandatory, vfs_list_add() will be modified to supply a vfs_t with a vfs_impl_t if it does not already have one. This is significant because unexpected behaviour could occur if an unbundled file system allocates its own vfs_t for insertion into the in-kernel mnttab. Consequently, developers of the unbundled file systems PxFS, VxFS, QFS and MVFS have been approached and have provided confirmation that the proposed changes will be harmless. Developers of OpenAFS have been approached but have not responded. File systems in ON are unaffected. 4.2 Interface changes: 1. The vfs_impl struct acquires a new member, vi_hrctime, which is the high-resolution creation time of the corresponding vfs_t. 2. vfs_list_add() will supply a vfs_t with a vfs_impl_t if it does not already have one. 3. The zone struct acquires two new members, zone_mntfs_db and zone_mntfs_db_lock. These implement the per-zone database described in section 3. 4. The existing ioctl() command MNTIOC_GETMNTENT will be modified and two new ioctl() commands, MNTIOC_GETEXTMNTENT and MNTIOC_GETMNTANY, will be created. These changes will support the getmntent(3C) family of functions. Note that they will create a backwards-incompatibility likely to affect S10-branded zones. These interfaces will all be Consolidation Private. 4.3 Other Once approved, this case will supercede PSARC/2009/352. 5. Release binding: Patch. 6. Documentation impact: None. 7. References: 1. CR 6394241 mntfs is not exec safe 2. CR 6813502 mntfs is not fork-safe program. 3. PSARC 2009/352. 4. PSARC 2006/270. 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open