I am sponsoring this following fasttrack for myself, requesting patch
binding and a timeout of 06/06/2007.



Template Version: @(#)sac_nextcase 1.61 05/24/07 SMI
This information is Copyright 2007 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Driver open-close exclusion guarantees

    1.2. Name of Document Author/Supplier:
         Author:  Chris Horne, Chris Gerhard

    1.3  Date of This Document:
        29 May, 2007

4. Technical Description:
    4.1.1 Summary:

        This proposal clarifies the interaction between the kernel and
        a non-stream driver's open(9E) and close(9E) implementation.
        The proposal focuses on two aspects of this interaction: the
        execution exclusion guarantees between open(9E) and close(9E)
        calls, and the last-reference accounting associated with
        close(9E) calls.

        The proposal includes open(9E), close(9E), and cb_ops(9S) man
        page changes, as well as a man pages for ddi-open-returns-eintr(9P).

        The proposal requests patch binding.

    4.1.1. Problems:

        UNIX has always used an open-close model where each device open
        results in an open(9E) call, and the last-reference close
        results in a single close(9E) call. While this basic model is
        simple and well understood, what this means for exclusion
        guarantees between open(9E) and close(9E) in a multi-threaded
        preemptive kernel environment like Solaris is not documented.

        UNIX last-reference accounting associated with a close(9E) call
        historically counted successfully completed open(9E) calls as
        'open'.  This works well in a single-threaded non-preemptive
        kernel environment, but it does not work well for Solaris.
        Solaris last-reference accounting has always treated
        in-progress open(9E) call as 'open', but this is not clearly
        documented.

        Without a clear definition of both the exclusion guarantees and
        last-reference accounting, it is difficult to write a reliable
        driver.

    4.1.2. Proposal:

        This proposal defines, for non-streams drivers, execution
        exclusion guarantees between open(9E) and close(9E) calls, and
        last-reference accounting associated with close(9E) calls.

    4.1.2.1 Exclusion:

        To provide open-close exclusion in a multi-threaded preemptive
        kernel environment like Solaris, an executing close(9E) call
        must act as a barrier to all subsequent open(9E) calls: the
        last-reference close(9E) call needs to return before the next
        open(9E) call is allowed to start.

        Today the kernel implements open-close exclusion for streams
        drivers, but not for non-streams drivers. Non-streams drivers
        either incorrectly assume exclusion or are complicated by
        needing to implement their own exclusion.

        When exclusion for non-streams drivers is implemented, in
        situations where an active close(9E) call is preventing a new
        open(9E) call due to exclusion, having the framework always
        treat the waiting-open as interruptible is unsafe -
        applications may not be coded to expect a new EINTR return from
        open. This proposal provides new interfaces that allow the
        framework to determine if a waiting-open is safely
        interruptible.

        Exclusion is provided at (dev_t, otyp) granularity, where dev_t
        and otyp refer to open(9E) arguments. The otyp values of
        interest are OTYP_BLK and OTYP_CHR. If this granularity is too
        fine-grained, the driver writer is left having to implement his
        own exclusion and accounting (often at ddi_get_instance(9F)
        granularity). Providing exclusion guarantees at instance
        granularity is outside the scope of this proposal.

    4.1.2.2 Last-reference accounting:

        Last-reference accounting occurs at the same (dev_t, otyp)
        granularity as exclusion. Solaris last-reference accounting has
        always treated in-progress open(9E) calls as 'open', but this
        is not clearly documented.

        No change to last-reference accounting is proposed, however, an
        explanation of how accounting is implemented is necessary,
        especially for implementing 'special behaviors' where the
        driver open(9E) and close(9E) implementations interact.

    4.1.2.3 Special Behaviors:

        Understanding exclusion guarantees and last-reference
        accounting typically simplify driver writing. However, for some
        behaviors additional guidance is still needed. Implementing
        these behaviors involves 'self-clone', where a driver changes
        the *devp value passed to open(9E). A driver that does a
        self-clone does not necessarily need to call
        ddi_create_minor_node(9F) for the new *devp value.

        o A driver that supports O_NDELAY (FNDELAY) and blocks in
          open(9E) or close(9E) for an event that takes a long time (or
          may never occur) must use separate minor nodes for O_NDELAY
          and non-O_NDELAY access for the applications to get real
          O_NDELAY behavior. Applications using the device must either
          match the minor node used with their O_NDELAY flag use, or
          the driver must self-clone to match O_NDELAY flag use.

          This guidance is related to both exclusion and last-reference
          accounting. For exclusion, this guidance prevents a new
          O_NDELAY open from waiting on completion of a non-O_NDELAY
          close(9E). For last-reference accounting, this guidance
          allows an O_NDELAY close(9E) to occur while there is a
          blocked non-O_NDELAY open(9E) call.

          This is already a de facto Solaris requirement: an example is
          the OUTLINE implementation used by serial communications
          drivers like zs(7D) .

          In this situation Solaris specific DDI considerations
          influence how a driver must implement a POSIX compliant
          O_NDELAY open(9E). An unmodified SVR4 driver's O_NDELAY
          open(9E) implementation may not be POSIX compliant under
          Solaris.

          NOTE: Some drivers (such as sd(7D)) use O_NDELAY to support
          administrative commands which need to open the device prior
          to full device initialization. These drivers fail their
          non-O_NDELAY open(9E) instead of blocking, so they do not
          need to use separate minor nodes.

        o A driver that blocks in open(9E) for an event signaled from
          close(9E) must self-clone.

          This guidance is related to last-reference accounting. If not
          followed, the close(9E) call will never occur since an
          in-progress open(9E) call counts as an 'open'.

          This is already a de facto Solaris requirement: an example is
          a queuing exclusive use device, like a printer. Originally,
          UNIX printer drivers slept in open(9E) if the device was
          already in use.  This provided a driver-based queuing
          system.

        o A driver that blocks in close(9E) for an event that takes a
          long time (or may never occur) is preventing subsequent
          open(9E) operations. While blocking in close(9E) is not
          prohibited, the driver writer needs to understand the
          ramifications, possibly setting the D_OPEN_RETURNS_EINTR
          cb_ops(9S) flag or setting ddi-open-returns-eintr(9P) in
          driver.conf(4) if it is safe to return EINTR from open.

          This guidance is related to exclusions guarantees.

          This is already a de facto Solaris requirement for streams:
          an example is maximum drain times on close for streams. The
          ramifications of blocking indefinitely in close are not new
          for streams since streams currently has exclusion.
          Applications opening streams already expect EINTR, so the
          waiting-open can be interruptible.

        In the situations above, implementing multiple minor nodes or
        doing a 'self-clone' expands the operation beyond the typical
        (dev_t, otyp) granularity, so exclusion and last-reference
        accounting are no longer an impediment to implementing atypical
        behaviors.

    4.1.2.4 Legacy non-DDI compliant interface issues:

        The Solaris open(9E) close(9E) exclusion guarantee is annulled
        when kernel software, other than specfs, uses the following
        private non-DDI interfaces: dev_open(), dev_close(), cb_ops(9S)
        cb_open, or cb_ops(9S) cb_close.  If these private non-DDI
        interfaces are used, no new problems occur, but consumers
        should switch to use the Layered Driver Interfaces (LDI, PSARC
        2001/769).  LDI provides a DDI compliant way to perform these
        operations which does not annul exclusion guarantees.

    4.2. Bug/RFE Number(s):

        6343604 specfs race: multiple "last-close" of the same device
        4127807 DDI:  Is there a race between open(9e) and close(9e)?
    
    4.5. Interfaces:

        ------------------------------------------------------------------------
        Interface                       Level                   Comments
        ------------------------------------------------------------------------

        Existing:
          open(9E)                      Committed       Define exclusion and
          close(9E)                     "               last-reference behavior.

        New:
          D_OPEN_RETURNS_EINTR          "               cb_ops(9S) cb_flag:
                                                        Driver returns and
                                                        applications expects
                                                        EINTR from device open.

          ddi-open-returns-eintr(9P)    "               driver.conf(4) property:
                                                        Driver returns and
                                                        applications expects
                                                        EINTR from device open.

    
6. Resources and Schedule:
   6.4. Product Approval Committee requested information:
        6.4.1. Consolidation or Component Name:

        ON

   6.5. ARC review type:

        FastTrack



A. Man page changes

A.1 open(9E) man page changes:
  
    Driver Entry Points                                      open(9E)
    
    NAME
         open - gain access to a device
    
    SYNOPSIS
      Block and Character
         #include <sys/types.h>
         #include <sys/file.h>
         #include <sys/errno.h>
         #include <sys/open.h>
         #include <sys/cred.h>
         #include <sys/ddi.h>
         #include <sys/sunddi.h>
    
         int prefixopen(dev_t  *devp,  int  flag,  int  otyp,  cred_t
         *cred_p);
    
      STREAMS
         #include <sys/file.h>
         #include <sys/stream.h>
         #include <sys/ddi.h>
         #include <sys/sunddi.h>
    
         int prefixopen(queue_t  *q,  dev_t  *devp,  int  oflag,  int
         sflag, cred_t *cred_p);
    
  ---%<---
    
    DESCRIPTION
         The driver's  open() routine is called by the kernel  during
         an   open(2) or a  mount(2) on the special file for the
  >      device.  A device may be opened simultaneously by multiple
  >      processes and  the  open() driver routine is called for each
  >      open.  Note that a device is referenced once its associated
  >      open(9E) routine is entered, and thus open(9E)'s which have
  >      not yet completed will prevent close(9E) from being called.
  >
  |      The routine should verify that the minor  number
         component of *devp is valid, that the type of access requested
         by  otyp and  flag is appropriate for the  device,  and,  if
         required,  check  permissions  using  the  user  credentials
         pointed to by  cred_p.

  >      The kernel provides open() close() exclusion guarantees to the
  >      driver at (*devp, otyp) granularity.  This delays new open()
  >      calls to the driver while a last-reference close() call is
  >      executing.  If the driver has indicated that an EINTR return
  >      is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_fla or
  >      ddi-open-returns-eintr(9P) then a delayed open() may be
  >      interrupted by a signal, resulting in an EINTR return.
  >
  >      Last-reference accounting and open() close() exclusion
  >      typically simplify driver writing, however, in some cases they
  >      may be an impediment for certain types of drivers. To overcome
  >      any impediment the driver can change minor numbers in open(9E),
  >      as described below, or implement multiple minor nodes for the
  >      same device - both techniques give the driver control over
  >      when close() calls will occur and whether additional open()
  >      calls will be delayed while close() is executing.

         The open() routine is passed a pointer to a device number so
         that  the  driver  can  change the minor number. This allows
         drivers to dynamically  create minor instances of  the  dev-
         ice.   An example of this might be a  pseudo-terminal driver
         that creates a new pseudo-terminal whenever it   is  opened.
         A driver that chooses the minor number dynamically, normally
         creates only one  minor  device  node  in   attach(9E)  with
         ddi_create_minor_node(9F) then changes the minor number com-
         ponent of *devp using makedevice(9F)  and  getmajor(9F).
         The driver needs to keep track of available minor numbers
  >      internally.  A driver that dynamically creates minor
  >      numbers may want to avoid returning the original minor
  >      number since returning the original minor will result in
  >      postponed dynamic opens when original minor close() call
  >      occurs.

  ---%<---

    SEE ALSO
  >      ddi-open-returns-eintr(9P), cb_ops(9S)


  ---%<---


A.2 close(9E) man page changes:
  
    Driver Entry Points                                     close(9E)
    
    NAME
         close - relinquish access to a device
    
    SYNOPSIS
      Block and Character
         #include <sys/types.h>
         #include <sys/file.h>
         #include <sys/errno.h>
         #include <sys/open.h>
         #include <sys/cred.h>
         #include <sys/ddi.h>
         #include <sys/sunddi.h>
    
         int  prefixclose(dev_t  dev,  int  flag,  int  otyp,  cred_t
         *cred_p);
    
  ---%<---
    
    DESCRIPTION
         For STREAMS drivers, the  close() routine is called  by  the
         kernel  through  the  cb_ops(9S) table entry for the device.
         (Modules use the  fmodsw table.) A  non-null  value  in  the
         d_str  field  of  the   cb_ops  entry points to a  streamtab
         structure, which points to a qinit(9S) containing a  pointer
         to  the   close() routine. Non-STREAMS  close() routines are
         called directly from the  cb_ops table.
    
         close() ends the connection between the user process and the
         device,  and  prepares the device (hardware and software) so
         that it is ready to be opened again.
    
  <      A device may be opened simultaneously by multiple  processes
  <      and  the  open() driver routine is called for each open, but
  <      the kernel will only call the  close() routine when the last
  <      process  using  the  device issues a  close(2) or  umount(2)
  <      system call or exits. (An exception is  a  close  occurring
  <      with  the  otyp argument set to  OTYP_LYR, for which a close
  <      (also having otyp = OTYP_LYR) occurs for each open.)
  
  >      A device may be opened simultaneously by multiple  processes
  >      and  the  open() driver routine is called for each open.
  >      For all otyp values other than OTYP_LYR the kernel calls
  >      the close() routine when the last-reference occurs. For
  >      OTYP_LYR each close operation will call the driver.
  >
  >      Kernel accounting for last-reference occurs at (dev, otyp)
  >      granularity.  Note that a device is referenced once its
  >      associated open(9E) routine is entered, and thus open(9E)'s
  >      which have not yet completed will prevent close(9E) from
  >      being called.  The driver close(9E) call associated with the
  >      last-reference going away is typically issued as as result
  >      of a close(2), exit(2), munmap(2), or umount(2). However, a
  >      failed open(9E) call can cause this last-reference close(9E)
  >      call to be issued as a result of an open(2) or mount(2).
  >
  >      The kernel provides open() close() exclusion guarantees
  >      to the driver at the same (dev, otyp) granularity as
  >      last-reference accounting. The kernel delays new calls to the
  >      open() driver routine while the last-reference close() call is
  >      executing - a driver that blocks in close() will not see new
  >      calls to open() until it returns from close().  This
  >      effectively delays invocation of other cb_ops(9S) driver entry
  >      points that depend on an open(9E) established device reference
  >      too. If the driver has indicated that an EINTR return
  >      is safe via the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag or
  >      ddi-open-returns-eintr(9P) then a delayed open() may be
  >      interrupted by a signal, resulting in an EINTR return from
  >      open() prior to calling open(9E).
  >
  >      Last-reference accounting and open() close() exclusion typically
  >      simplify driver writing, however, in some cases they may be
  >      an impediment for certain types of drivers. To overcome any
  >      impediment the driver can change minor numbers in open(9E)
  >      or implement multiple minor nodes for the same device -
  >      both techniques give the driver control over when close()
  >      calls will occur and whether additional open() calls will
  >      be delayed while close() is executing.

         In general, a  close() routine should always check the
         validity  of  the  minor number component of the  dev
         parameter.  The routine should also check permissions as
         necessary,  by using  the user credential structure (if
         pertinent), and the appropriateness of the  flag and  otyp
         parameter values.

  ---%<---

    SEE ALSO
  >      ddi-open-returns-eintr(9P), cb_ops(9S)

  ---%<---


A.3 cb_ops(9S) man page change:

      If the driver properly handles  64-bit  offsets,  it  should
      also  set the D_64BIT flag in the cb_flag field. This speci-
      fies that the driver will use the uio_loffset field  of  the
      uio(9S) structure.
 
+     If the driver returns EINTR from open(9E), it should also set the
+     D_OPEN_RETURNS_EINTR flag in the cb_flag field.  This lets the
+     framework know that it is safe for it to return EINTR when
+     waiting, to provide exclusion, for a last-reference close(9E)
+     call to complete before calling open(9E).
+
      mt-streams(9F) describes other flags that can be set in  the
      cb_flag field.
 
      cb_rev is the cb_ops structure revision number.  This  field
      must be set to CB_REV.


A.4 ddi-open-returns-eintr.9p man page:
  
  Kernel Properties for Drivers          ddi-open-returns-eintr(9P)
  
  NAME
       ddi-open-returns-eintr - property indicates that device open can
       safely return EINTR.
  
  DESCRIPTION
       When ddi-open-returns-eintr is set the kernel knows that an EINTR
       return from open(9E) is an expected result.  This allows the
       kernel, in its implementation of open/close exclusion, to be
       interruptible and fail an open with EINTR when an active close(9E)
       operation, at (dev_t, spectype) granularity, is preventing a new
       open(9E).
  
       Set this property via driver.conf(4) if open(9E) implementation
       returns EINTR, especially when waiting for an active close(9E)
       operation.  When property is set, kernel behavior is identical to
       when the D_OPEN_RETURNS_EINTR cb_ops(9S) cb_flag is set.
  
  
  SEE ALSO
       open(9E), close(9E), cb_ops(9S)
  
       Writing Device Drivers

Reply via email to