I am sponsoring this following fasttrack for myself, requesting patch
binding and a timeout of 06/06/2007.
Template Version: @(#)sac_nextcase 1.61 05/24/07 SMI
This information is Copyright 2007 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
Driver open-close exclusion guarantees
1.2. Name of Document Author/Supplier:
Author: Chris Horne, Chris Gerhard
1.3 Date of This Document:
29 May, 2007
4. Technical Description:
4.1.1 Summary:
This proposal clarifies the interaction between the kernel and
a non-stream driver's open(9E) and close(9E) implementation.
The proposal focuses on two aspects of this interaction: the
execution exclusion guarantees between open(9E) and close(9E)
calls, and the last-reference accounting associated with
close(9E) calls.
The proposal includes open(9E), close(9E), and cb_ops(9S) man
page changes, as well as a man pages for ddi-open-returns-eintr(9P).
The proposal requests patch binding.
4.1.1. Problems:
UNIX has always used an open-close model where each device open
results in an open(9E) call, and the last-reference close
results in a single close(9E) call. While this basic model is
simple and well understood, what this means for exclusion
guarantees between open(9E) and close(9E) in a multi-threaded
preemptive kernel environment like Solaris is not documented.
UNIX last-reference accounting associated with a close(9E) call
historically counted successfully completed open(9E) calls as
'open'. This works well in a single-threaded non-preemptive
kernel environment, but it does not work well for Solaris.
Solaris last-reference accounting has always treated
in-progress open(9E) call as 'open', but this is not clearly
documented.
Without a clear definition of both the exclusion guarantees and
last-reference accounting, it is difficult to write a reliable
driver.
4.1.2. Proposal:
This proposal defines, for non-streams drivers, execution
exclusion guarantees between open(9E) and close(9E) calls, and
last-reference accounting associated with close(9E) calls.
4.1.2.1 Exclusion:
To provide open-close exclusion in a multi-threaded preemptive
kernel environment like Solaris, an executing close(9E) call
must act as a barrier to all subsequent open(9E) calls: the
last-reference close(9E) call needs to return before the next
open(9E) call is allowed to start.
Today the kernel implements open-close exclusion for streams
drivers, but not for non-streams drivers. Non-streams drivers
either incorrectly assume exclusion or are complicated by
needing to implement their own exclusion.
When exclusion for non-streams drivers is implemented, in
situations where an active close(9E) call is preventing a new
open(9E) call due to exclusion, having the framework always
treat the waiting-open as interruptible is unsafe -
applications may not be coded to expect a new EINTR return from
open. This proposal provides new interfaces that allow the
framework to determine if a waiting-open is safely
interruptible.
Exclusion is provided at (dev_t, otyp) granularity, where dev_t
and otyp refer to open(9E) arguments. The otyp values of
interest are OTYP_BLK and OTYP_CHR. If this granularity is too
fine-grained, the driver writer is left having to implement his
own exclusion and accounting (often at ddi_get_instance(9F)
granularity). Providing exclusion guarantees at instance
granularity is outside the scope of this proposal.
4.1.2.2 Last-reference accounting:
Last-reference accounting occurs at the same (dev_t, otyp)
granularity as exclusion. Solaris last-reference accounting has
always treated in-progress open(9E) calls as 'open', but this
is not clearly documented.
No change to last-reference accounting is proposed, however, an
explanation of how accounting is implemented is necessary,
especially for implementing 'special behaviors' where the
driver open(9E) and close(9E) implementations interact.
4.1.2.3 Special Behaviors:
Understanding exclusion guarantees and last-reference
accounting typically simplify driver writing. However, for some
behaviors additional guidance is still needed. Implementing
these behaviors involves 'self-clone', where a driver changes
the *devp value passed to open(9E). A driver that does a
self-clone does not necessarily need to call
ddi_create_minor_node(9F) for the new *devp value.
o A driver that supports O_NDELAY (FNDELAY) and blocks in
open(9E) or close(9E) for an event that takes a long time (or
may never occur) must use separate minor nodes for O_NDELAY
and non-O_NDELAY access for the applications to get real
O_NDELAY behavior. Applications using the device must either
match the minor node used with their O_NDELAY flag use, or
the driver must self-clone to match O_NDELAY flag use.
This guidance is related to both exclusion and last-reference
accounting. For exclusion, this guidance prevents a new
O_NDELAY open from waiting on completion of a non-O_NDELAY
close(9E). For last-reference accounting, this guidance
allows an O_NDELAY close(9E) to occur while there is a
blocked non-O_NDELAY open(9E) call.
This is already a de facto Solaris requirement: an example is
the OUTLINE implementation used by serial communications
drivers like zs(7D) .
In this situation Solaris specific DDI considerations
influence how a driver must implement a POSIX compliant
O_NDELAY open(9E). An unmodified SVR4 driver's O_NDELAY
open(9E) implementation may not be POSIX compliant under
Solaris.
NOTE: Some drivers (such as sd(7D)) use O_NDELAY to support
administrative commands which need to open the device prior
to full device initialization. These drivers fail their
non-O_NDELAY open(9E) instead of blocking, so they do not
need to use separate minor nodes.
o A driver that blocks in open(9E) for an event signaled from
close(9E) must self-clone.
This guidance is related to last-reference accounting. If not
followed, the close(9E) call will never occur since an
in-progress open(9E) call counts as an 'open'.
This is already a de facto Solaris requirement: an example is
a queuing exclusive use device, like a printer. Originally,
UNIX printer drivers slept in open(9E) if the device was
already in use. This provided a driver-based queuing
system.
o A driver that blocks in close(9E) for an event that takes a
long time (or may never occur) is preventing subsequent
open(9E) operations. While blocking in close(9E) is not
prohibited, the driver writer needs to understand the
ramifications, possibly setting the D_OPEN_RETURNS_EINTR
cb_ops(9S) flag or setting ddi-open-returns-eintr(9P) in
driver.conf(4) if it is safe to return EINTR from open.
This guidance is related to exclusions guarantees.
This is already a de facto Solaris requirement for streams:
an example is maximum drain times on close for streams. The
ramifications of blocking indefinitely in close are not new
for streams since streams currently has exclusion.
Applications opening streams already expect EINTR, so the
waiting-open can be interruptible.
In the situations above, implementing multiple minor nodes or
doing a 'self-clone' expands the operation beyond the typical
(dev_t, otyp) granularity, so exclusion and last-reference
accounting are no longer an impediment to implementing atypical
behaviors.
4.1.2.4 Legacy non-DDI compliant interface issues:
The Solaris open(9E) close(9E) exclusion guarantee is annulled
when kernel software, other than specfs, uses the following
private non-DDI interfaces: dev_open(), dev_close(), cb_ops(9S)
cb_open, or cb_ops(9S) cb_close. If these private non-DDI
interfaces are used, no new problems occur, but consumers
should switch to use the Layered Driver Interfaces (LDI, PSARC
2001/769). LDI provides a DDI compliant way to perform these
operations which does not annul exclusion guarantees.
4.2. Bug/RFE Number(s):
6343604 specfs race: multiple "last-close" of the same device
4127807 DDI: Is there a race between open(9e) and close(9e)?
4.5. Interfaces:
------------------------------------------------------------------------
Interface Level Comments
------------------------------------------------------------------------
Existing:
open(9E) Committed Define exclusion and
close(9E) " last-reference behavior.
New:
D_OPEN_RETURNS_EINTR " cb_ops(9S) cb_flag:
Driver returns and
applications expects
EINTR from device open.
ddi-open-returns-eintr(9P) " driver.conf(4) property:
Driver returns and
applications expects
EINTR from device open.
6. Resources and Schedule:
6.4. Product Approval Committee requested information:
6.4.1. Consolidation or Component Name:
ON
6.5. ARC review type:
FastTrack
A. Man page changes
A.1 open(9E) man page changes:
Driver Entry Points open(9E)
NAME
open - gain access to a device
SYNOPSIS
Block and Character
#include <sys/types.h>
#include <sys/file.h>
#include <sys/errno.h>
#include <sys/open.h>
#include <sys/cred.h>
#include <sys/ddi.h>
#include <sys/sunddi.h>
int prefixopen(dev_t *devp, int flag, int otyp, cred_t
*cred_p);
STREAMS
#include <sys/file.h>
#include <sys/stream.h>
#include <sys/ddi.h>
#include <sys/sunddi.h>
int prefixopen(queue_t *q, dev_t *devp, int oflag, int
sflag, cred_t *cred_p);
---%<---
DESCRIPTION
The driver's open() routine is called by the kernel during
an open(2) or a mount(2) on the special file for the
> device. A device may be opened simultaneously by multiple
> processes and the open() driver routine is called for each
> open. Note that a device is referenced once its associated
> open(9E) routine is entered, and thus open(9E)'s which have
> not yet completed will prevent close(9E) from being called.
>
| The routine should verify that the minor number
component of *devp is valid, that the type of access requested
by otyp and flag is appropriate for the device, and, if
required, check permissions using the user credentials
pointed to by cred_p.
> The kernel provides open() close() exclusion guarantees to the
> driver at (*devp, otyp) granularity. This delays new open()
> calls to the driver while a last-reference close() call is
> executing. If the driver has indicated that an EINTR return
> is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_fla or
> ddi-open-returns-eintr(9P) then a delayed open() may be
> interrupted by a signal, resulting in an EINTR return.
>
> Last-reference accounting and open() close() exclusion
> typically simplify driver writing, however, in some cases they
> may be an impediment for certain types of drivers. To overcome
> any impediment the driver can change minor numbers in open(9E),
> as described below, or implement multiple minor nodes for the
> same device - both techniques give the driver control over
> when close() calls will occur and whether additional open()
> calls will be delayed while close() is executing.
The open() routine is passed a pointer to a device number so
that the driver can change the minor number. This allows
drivers to dynamically create minor instances of the dev-
ice. An example of this might be a pseudo-terminal driver
that creates a new pseudo-terminal whenever it is opened.
A driver that chooses the minor number dynamically, normally
creates only one minor device node in attach(9E) with
ddi_create_minor_node(9F) then changes the minor number com-
ponent of *devp using makedevice(9F) and getmajor(9F).
The driver needs to keep track of available minor numbers
> internally. A driver that dynamically creates minor
> numbers may want to avoid returning the original minor
> number since returning the original minor will result in
> postponed dynamic opens when original minor close() call
> occurs.
---%<---
SEE ALSO
> ddi-open-returns-eintr(9P), cb_ops(9S)
---%<---
A.2 close(9E) man page changes:
Driver Entry Points close(9E)
NAME
close - relinquish access to a device
SYNOPSIS
Block and Character
#include <sys/types.h>
#include <sys/file.h>
#include <sys/errno.h>
#include <sys/open.h>
#include <sys/cred.h>
#include <sys/ddi.h>
#include <sys/sunddi.h>
int prefixclose(dev_t dev, int flag, int otyp, cred_t
*cred_p);
---%<---
DESCRIPTION
For STREAMS drivers, the close() routine is called by the
kernel through the cb_ops(9S) table entry for the device.
(Modules use the fmodsw table.) A non-null value in the
d_str field of the cb_ops entry points to a streamtab
structure, which points to a qinit(9S) containing a pointer
to the close() routine. Non-STREAMS close() routines are
called directly from the cb_ops table.
close() ends the connection between the user process and the
device, and prepares the device (hardware and software) so
that it is ready to be opened again.
< A device may be opened simultaneously by multiple processes
< and the open() driver routine is called for each open, but
< the kernel will only call the close() routine when the last
< process using the device issues a close(2) or umount(2)
< system call or exits. (An exception is a close occurring
< with the otyp argument set to OTYP_LYR, for which a close
< (also having otyp = OTYP_LYR) occurs for each open.)
> A device may be opened simultaneously by multiple processes
> and the open() driver routine is called for each open.
> For all otyp values other than OTYP_LYR the kernel calls
> the close() routine when the last-reference occurs. For
> OTYP_LYR each close operation will call the driver.
>
> Kernel accounting for last-reference occurs at (dev, otyp)
> granularity. Note that a device is referenced once its
> associated open(9E) routine is entered, and thus open(9E)'s
> which have not yet completed will prevent close(9E) from
> being called. The driver close(9E) call associated with the
> last-reference going away is typically issued as as result
> of a close(2), exit(2), munmap(2), or umount(2). However, a
> failed open(9E) call can cause this last-reference close(9E)
> call to be issued as a result of an open(2) or mount(2).
>
> The kernel provides open() close() exclusion guarantees
> to the driver at the same (dev, otyp) granularity as
> last-reference accounting. The kernel delays new calls to the
> open() driver routine while the last-reference close() call is
> executing - a driver that blocks in close() will not see new
> calls to open() until it returns from close(). This
> effectively delays invocation of other cb_ops(9S) driver entry
> points that depend on an open(9E) established device reference
> too. If the driver has indicated that an EINTR return
> is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag or
> ddi-open-returns-eintr(9P) then a delayed open() may be
> interrupted by a signal, resulting in an EINTR return from
> open() prior to calling open(9E).
>
> Last-reference accounting and open() close() exclusion typically
> simplify driver writing, however, in some cases they may be
> an impediment for certain types of drivers. To overcome any
> impediment the driver can change minor numbers in open(9E)
> or implement multiple minor nodes for the same device -
> both techniques give the driver control over when close()
> calls will occur and whether additional open() calls will
> be delayed while close() is executing.
In general, a close() routine should always check the
validity of the minor number component of the dev
parameter. The routine should also check permissions as
necessary, by using the user credential structure (if
pertinent), and the appropriateness of the flag and otyp
parameter values.
---%<---
SEE ALSO
> ddi-open-returns-eintr(9P), cb_ops(9S)
---%<---
A.3 cb_ops(9S) man page change:
If the driver properly handles 64-bit offsets, it should
also set the D_64BIT flag in the cb_flag field. This speci-
fies that the driver will use the uio_loffset field of the
uio(9S) structure.
+ If the driver returns EINTR from open(9E), it should also set the
+ D_OPEN_RETURNS_EINTR flag in the cb_flag field. This lets the
+ framework know that it is safe for it to return EINTR when
+ waiting, to provide exclusion, for a last-reference close(9E)
+ call to complete before calling open(9E).
+
mt-streams(9F) describes other flags that can be set in the
cb_flag field.
cb_rev is the cb_ops structure revision number. This field
must be set to CB_REV.
A.4 ddi-open-returns-eintr.9p man page:
Kernel Properties for Drivers ddi-open-returns-eintr(9P)
NAME
ddi-open-returns-eintr - property indicates that device open can
safely return EINTR.
DESCRIPTION
When ddi-open-returns-eintr is set the kernel knows that an EINTR
return from open(9E) is an expected result. This allows the
kernel, in its implementation of open/close exclusion, to be
interruptible and fail an open with EINTR when an active close(9E)
operation, at (dev_t, spectype) granularity, is preventing a new
open(9E).
Set this property via driver.conf(4) if open(9E) implementation
returns EINTR, especially when waiting for an active close(9E)
operation. When property is set, kernel behavior is identical to
when the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag is set.
SEE ALSO
open(9E), close(9E), cb_ops(9S)
Writing Device Drivers