I am sponsoring the following fasttrack for Shidokht Yadegari.  It
introduces a new 'extended' VTOC API for storage in the 1-2TB capacity
range.  The new API supports Solaris disk labeling in a way that is
compatible with current OBP boot code on sparc and is compatible with
current multi-OS MBR fdisk practice on x86.  The proposed release
binding is micro/patch.  Additional referenced material is in the case
directory.

-Chris


Template Version: @(#)sac_nextcase 1.66 04/17/08 SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         Extended VTOC
    1.2. Name of Document Author/Supplier:
         Author:  Shidokht Yadegari
    1.3  Date of This Document:
        20 May, 2008
4. Technical Description
4.1 Introduction

    While a 64-bit Solaris kernel currently supports EFI (Extensible
    Firmware Interface - Intel) GPT (GUID Partition Table) labeled
    disks larger than 1TB [1, 7], current boot code does not support
    GPT.  The current Sun disk label limits the size of a bootable disk
    to less than 1TB.

    A Sun disk label separates private on-disk data from the public
    API. To support booting from disks in the 1TB-2TB size range while
    the long term GPT boot solution is being resolved, this case
    proposes taking advantage of the private/public separation. The
    case proposes a minimal compatible modification to the private
    on-disk 'struct dk_label' from SunOS_4.x [8] and a compatible
    extension to the public VTOC API.

    Even if GPT boot support was available for Solaris today, there is
    still a need for Solaris to support the proposed extend VTOC on x86
    machines since Windows and Linux depend on using non-GPT MBR fdisk
    for disks less than 2TB (single disk with separate fdisk partitions
    that boot different operating systems).

    There is no current support for disks over 1TB on 32-bit kernels
    [1]. This project does not propose any change in this respect.

    This project requests patch release binding.

4.2 Solution Overview

    The current 1TB 'dk_label' and VTOC limits originate from using
    signed 32-bit values for the start and size of a slice.

    This case changes the interpretation of the private on-disk values,
    and controls exposure of these changes at the public API level to
    quickly provide support for boot devices in the 1-2TB size range.

    o For the on-disk 'dk_label', block address signed fields are
      treated as unsigned. To prevent breakage of sparc OBP boot code,
      which already treats block addresses as unsigned, no revision of
      the on-disk format is proposed.

    o For the VTOC API, the Solaris kernel code detects a 1TB-2TB disk
      based on disk capacity and enforces correct API access. This
      proposal covers

      * A new 'extvtoc' structure, where the extended VTOC is identical
        to the current VTOC with the following exceptions:

        o Extended VTOC uses the size-invariant 64-bit 'diskaddr_t' to
          represent disk block addresses.

        o Extended VTOC fields are defined using size-invariant types
          in order to reduce the 32/64 ioctl data-model mapping.

        o NOTE: While the new 'extvtoc' API can support sizes much
          larger than 2TB, on-disk dk_label limitations limit capacity
          to 2TB.  A device that is larger than 2TB can be supported in
          a limited fashion by reducing the dk_label represented size
          to 2TB.

      * New ioctls for the 'extvtoc'. The disk capacity establishes the
        VTOC mode of a disk: current (<1TB) or extended (>1TB).
        Extened ioctls work in both modes.

      * Behavior of current VTOC API for >1TB disks.

    This associated changes will effect

    * Disk target driver

    * Interfaces: ioctls/libadm

    * Utilities/Install

    as described below.

4.2.1 Disk Target Drivers 

    The project proposes the following changes to target drivers
    (mostly sd/ssd/cmdk drivers):

    * Move the cutoff limit for VTOC labeling to 2TB.

    * Support having a limited-to-2TB VTOC (and legacy MBR on x86) on
      disks larger than 2TB.

    * Allow the x86 only whole disk node /dev/rdsk/cXtYdZp0 to be
      always accessible regardless of disk content.

4.2.2 Interfaces: ioctls/libadm

4.2.2.1 Data Structures and ioctl Behavior:

4.2.2.1.1 Current Limitations 

    For the current vtoc, the following fields are associated with size
    limitations:

        struct vtoc {
                struct partition v_part[]{
                        daddr_t p_start; 
                        long    p_size;  
                }
        }
        struct part_info {
                daddr_t    p_start;
                int        p_length;
        }
        struct dk_allmap {
                struct dk_map dka_map[] {
                        daddr_t dkl_nblk; 
                }
        }

    Since 'daddr_t' is a long, we have a 1TB limit for a 32-bit
    application. Also, 'p_length' implies 1TB limit even in 64-bit
    applications.

        DKIOCSVTOC              struct vtoc
        DKIOCGVTOC              struct vtoc
        DKIOCPARTINFO           struct part_info
        read_vtoc               struct vtoc
        write_vtoc              struct vtoc
        DKIOCGAPART             struct dk_allmap
        DKIOCSAPART             struct dk_allmap

    The above ioctls/library functions (plus the geometry related
    ioctls) do not allow partitioning a disk > 1TB with current VTOC
    API.

4.2.2.1.2 Proposed Solution

    Introduce new ioctls, library functions and data structures (except
    for DKIOCGAPART and DKIOCSAPART, explanation further down).

        DKIOCEXTSVTOC           struct extvtoc
        DKIOCGEXTVTOC           struct extvtoc
        DKIOCEXTPARTINFO        struct extpart_info (x86 only)
        read_extvtoc()          struct extvtoc
        write_extvtoc()         struct extvtoc

    Among possible alternative solutions, redefining struct vtoc with
    unsigned values for start/size was considered, but was rejected in
    order to maintain source compatibility with any third-party
    applications consuming current interfaces.

4.2.2.1.2.1 Proposed Solution: Header file changes: vtoc.h

    Add the following new structures, defines, and interfaces - using
    32/64-bit size-invariant data types and 'diskaddr_t' for disk
    addresses:  (NOTE: 'ext' comments  come from current VTOC API
    wording)

    struct extpartition {
        ushort_t p_tag;                 /* ID tag of partition */
        ushort_t p_flag;                /* permission flags */
        ushort_t p_pad[2];
        diskaddr_t p_start;             /* start sector no of partition */
        diskaddr_t p_size;              /* # of blocks in partition */
    };

    struct extvtoc {
        uint64_t        v_bootinfo[3];  /* info needed by mboot (unsupported) */
        uint64_t        v_sanity;       /* to verify vtoc sanity */
        uint64_t        v_version;      /* layout version */
        char            v_volume[LEN_DKL_VVOL]; /* volume name */
        ushort_t        v_sectorsz;     /* sector size in bytes */
        ushort_t        v_nparts;       /* number of partitions */
        ushort_t        pad[2];
        uint64_t        v_reserved[10];
        struct extpartition v_part[V_NUMPAR]; /* partition headers */
        uint64_t timestamp[V_NUMPAR];   /* partition timestamp (unsupported) */
        char    v_asciilabel[LEN_DKL_ASCII];    /* for compatibility */
    };

    #define V_EXTVERSION        V_VERSION /* extvtoc layout version number */
    #define     VT_EOVERFLOW    (-7)      /* VTOC op. data struct limited */

    extern      int     read_extvtoc(int, struct extvtoc *);
    extern      int     write_extvtoc(int, struct extvtoc *);


4.2.2.1.2.2 Proposed Solution: Header file changes: dkio.h

    The following new ioctls and data structures are defined:

    #define     DKIOCGEXTVTOC   (DKIOC|23)      
    #define     DKIOCSEXTVTOC   (DKIOC|24)

    #define     DKIOCEXTPARTINFO (DKIOC|19)

    struct extpart_info {
        diskaddr_t      p_start;
        diskaddr_t      p_length;
    };

    Proposed updates to dkio(7I) and read_vtoc(3EXT) man pages are in
    materials directory.

4.2.2.1.2.3 Proposed Solution: Code Change Overview

    Main changes when operating on >1TB disks:

    * DKIOCSVTOC and DKIOCGVTOC (unless disk is GPT labeled) will
      return EOVERFLOW instead of ENOTSUP. For read_vtoc and write_vtoc
      a new return value, VT_OVERFLOW, is defined, to be returned when
      DKIOCGVTOC/DKIOCSVTOC return EOVERFLOW. This is mainly a
      notification mechanism for applications to use new interfaces
      instead.

    * DKIOCGGEOM/DKIOCSGEOM and DKIOCG_PHYGEOM will be supported.

    * DKIOCPARTINFO will return EOVERFLOW [3]. Application should use
      DKIOCEXTPARTINFO instead.

    Please see ioctl_behavior_matrix.txt in materials directory for
    specific details on ioctls behavior.

    Rationale for not extending DKIOC(G/S)APART:

    * DKIOCSAPART was EOLed by PSARC/2001/570 [1], although EOL was not
      announced.

    * A combination of other ioctls (DKIOCGGEOM, DKIOCGVTOC/DKIOCSVTOC)
      can be used to achieve the same end results with some additional
      calculations.

    * These ioctls are very rarely used. Install does not utilize
      them. In Nevada ON gate, there are 3 consumers, 2 of which are
      for floppies; the third one is format (which we will change with
      this project).

    * DKIO ioctl numbers are precious, and we don't want to take more
      if we don't have to.

4.2.3 Utilities/Install

4.2.3.1 format(1M):

    format (-e) will allow labeling a disk with VTOC regardless of size
    of the disk. Also, since the cutoff for supporting VTOC in target
    drivers moved to 2TB, default label type for disks less than 2TB
    will be VTOC.

4.2.3.2 fmthard(1M):
 
    Since the cutoff for supporting VTOC in target drivers moved to
    2TB, default label type for disks less than 2TB will be VTOC.

    format should be used if user wants to enforce a VTOC on a disk >
    2TB if it does not have a VTOC label (or legacy MBR on x86)
    already.

4.2.3.3 fdisk(1M):

    Currently fdisk will fail on a disk > 1TB with no GPT label. fdisk
    also has issues with assuming signed 32-bit values instead of
    unsigned for MBR partition information which limits it to 1TB.

    The project proposes to allow fdisk to run on x86 regardless of
    contents on the disk.fdisk and consumers of fdisk will also be
    modified to use unsigned 32-bit values so we can handle up to 2TB.

4.2.3.4 SVM/metainit(1M):

    SVM will be extended to create metadevices comprised of physical
    disks with VTOC labels up to 2 TB. Metadevices fabricate label
    information so that they can handle disk ioctl requests. SVM will
    be changed to handle the DKIOCGEXTVTOC and DKIOCSEXTVTOC ioctls for
    metadevices up to 2 TB.

    Current behavior of the ioctls with metadevices: 

    <= 1 TB 
    DKIOCSVTOC - will always succeed 
    DKIOCGVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
    DIKIOCSETEFI - will always succeed 
    DKIOCGETEFI - will always succeed unless a VTOC labeled (ENOTSUP)

    > 1 TB 
    DKIOCSVTOC - will always fail (ENOTSUP) 
    DKIOCGVTOC - will always fail (ENOTSUP) 
    DIKIOCSETEFI - will always succeed 
    DKIOCGETEFI - will always succeed 

    Proposed behavior: 
    <= 1 TB 
    DKIOCSVTOC - will always succeed 
    DKIOCGVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
    DKIOCSEXTVTOC - will always succeed 
    DKIOCGEXTVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
    DIKIOCSETEFI - will always succeed 
    DKIOCGETEFI - will always succeed unless VTOC labeled (ENOTSUP)

    > 1TB < 2 TB 
    DKIOCSVTOC - will always fail (EOVERFLOW) 
    DKIOCGVTOC - will always fail (EOVERFLOW) unless EFI/GPT labeled (ENOTSUP)
    DKIOCSEXTVTOC - will always succeed 
    DKIOCGEXTVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
    DIKIOCSETEFI - will always succeed 
    DKIOCGETEFI - will always succeed unless VTOC labeled (ENOTSUP)

    > 2 TB 
    DKIOCSVTOC - will always fail (EOVERFLOW) 
    DKIOCGVTOC - will always fail (EOVERFLOW) unless EFI/GPT labeled (ENOTSUP)
    DKIOCSEXTVTOC - will always fail (ENOTSUP) 
    DKIOCGEXTVTOC - will always fail (ENOTSUP) 
    DIKIOCSETEFI - will always succeed 
    DKIOCGETEFI - will always succeed 

    There are no backward compatibility issues. 

4.2.3.5 Install:

    Since the cutoff for supporting VTOC in target drivers moved to
    2TB, install will also use VTOC labels for disks up to 2TB.

    GPT label support for install is not added by this phase of the
    project; it will be addressed in a future phase (when we are
    planning to move to GPT labels by default for boot on disks >
    2TB).

4.3 Private Interfaces

4.3.1 dklabel.h , vtoc.h, altsctr.h dadkio.h

    The on-disk VTOC label (dklabel.h) has the same issue with defining
    some structure fields as signed 32-bit values (which is the root of
    all 1TB limit VTOC limitations). In addition, structures defined
    for DIOCTL_RWCMD and alternate sector handling have the same
    issues.

    We will use larger data types for the problematic fields and enable
    them with a new preprocessor symbol test _EXTVTOC. This is done to
    keep source compatibility with drivers that are already using these
    structures and will not be modified for > 1 or 2 TB support (such
    as pcmcia disk support).

    The kernel macros vtoc32tovtoc and vtoctovtoc32 will be updated to
    take the larger data types into account.

4.4 Interface classification

Interface exported              Level                   Comments

struct extvtoc                  Evolving
struct extpart_info             Evolving                x86 only
DKIOCEXTSVTOC                   Evolving
DKIOCGEXTVTOC                   Evolving        
DKIOCEXTPARTINFO                Evolving                x86 only
read_extvtoc()                  Evolving
write_extvtoc()                 Evolving

4.5 Backwards Compatibility

    A disk > 1TB with a VTOC label expanded over 1TB is not supported
    on prior releases; document in a release note

    What happens if a >1TB disk with a VTOC label expanding over TB is
    used in an older release? Here is what we have found experimentally
    on a SCSI disk and by code inspection:

    o X86: Current Nevada  and S10U5 (both 64-bit):

      o The VTOC label is not recognized in target driver.

      o Normal open (with no non-block flag) of all disk minor nodes
        will fail.

      o Format will assume there is no VTOC on the disk and uses
        EFI/GPT label to label the device.

    o SPARC: Current Nevada:

      o Warning message from target driver is issued.

      o Target driver recognizes and accepts the VTOC.

      o Normal open of disk minor nodes succeeds similar to <1TB VTOC.

      o Mount and reading an existing ufs file system of > 1TB
        succeed.

      o Format gets confused (e.g an about 2TB disk is seen as 4.87 GB
        drive, auto-configure without -e option fails...).

    o S10U5:

      o Warning message from target driver is issued.

      o Target driver assumed the label is invalid.

      o Normal open of disk minor nodes fails.

      o Format -e does not recognize any label on the disk and gets
        confused (e.g auto-configuration fails).

References:

    [1] PSARC 2001/570 Multi-terabyte disk support
        <http://sac.sfbay/PSARC/2001/570>
        <http://www.opensolaris.org/os/community/arc/caselog/2001/570>

    [2] VTOC changes
        <http://sac.sfbay/PSARC/1991/062>
        <http://www.opensolaris.org/os/community/arc/caselog/1991/062>

    [3] 6691817 DKIOCPARTINFO can overflow ..."
        <http://monaco.sfbay.sun.com/detail.jsf?cr=6691817>
        <http://bugs.opensolaris.org/view_bug.do?bug_id=6691817>

    [4] x86 Disk Layout
        <http://sac.sfbay/PSARC/1993/015>
        <http://www.opensolaris.org/os/community/arc/caselog/1993/015>

    [5] Disk preparation, x86-specific 
        <http://sac.sfbay/PSARC/1993/019>
        <http://www.opensolaris.org/os/community/arc/caselog/1993/019>

    [6] Solaris Disk Label Conventions
        <http://sac.sfbay/PSARC/1994/072>
        <http://www.opensolaris.org/os/community/arc/caselog/1994/072>

    [7] <http://en.wikipedia.org/wiki/Master_boot_record>

    [8] <http://en.wikipedia.org/wiki/BSD_disklabel>


6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open


Reply via email to