I am sponsoring the following fasttrack for Shidokht Yadegari. It
introduces a new 'extended' VTOC API for storage in the 1-2TB capacity
range. The new API supports Solaris disk labeling in a way that is
compatible with current OBP boot code on sparc and is compatible with
current multi-OS MBR fdisk practice on x86. The proposed release
binding is micro/patch. Additional referenced material is in the case
directory.
-Chris
Template Version: @(#)sac_nextcase 1.66 04/17/08 SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
Extended VTOC
1.2. Name of Document Author/Supplier:
Author: Shidokht Yadegari
1.3 Date of This Document:
20 May, 2008
4. Technical Description
4.1 Introduction
While a 64-bit Solaris kernel currently supports EFI (Extensible
Firmware Interface - Intel) GPT (GUID Partition Table) labeled
disks larger than 1TB [1, 7], current boot code does not support
GPT. The current Sun disk label limits the size of a bootable disk
to less than 1TB.
A Sun disk label separates private on-disk data from the public
API. To support booting from disks in the 1TB-2TB size range while
the long term GPT boot solution is being resolved, this case
proposes taking advantage of the private/public separation. The
case proposes a minimal compatible modification to the private
on-disk 'struct dk_label' from SunOS_4.x [8] and a compatible
extension to the public VTOC API.
Even if GPT boot support was available for Solaris today, there is
still a need for Solaris to support the proposed extend VTOC on x86
machines since Windows and Linux depend on using non-GPT MBR fdisk
for disks less than 2TB (single disk with separate fdisk partitions
that boot different operating systems).
There is no current support for disks over 1TB on 32-bit kernels
[1]. This project does not propose any change in this respect.
This project requests patch release binding.
4.2 Solution Overview
The current 1TB 'dk_label' and VTOC limits originate from using
signed 32-bit values for the start and size of a slice.
This case changes the interpretation of the private on-disk values,
and controls exposure of these changes at the public API level to
quickly provide support for boot devices in the 1-2TB size range.
o For the on-disk 'dk_label', block address signed fields are
treated as unsigned. To prevent breakage of sparc OBP boot code,
which already treats block addresses as unsigned, no revision of
the on-disk format is proposed.
o For the VTOC API, the Solaris kernel code detects a 1TB-2TB disk
based on disk capacity and enforces correct API access. This
proposal covers
* A new 'extvtoc' structure, where the extended VTOC is identical
to the current VTOC with the following exceptions:
o Extended VTOC uses the size-invariant 64-bit 'diskaddr_t' to
represent disk block addresses.
o Extended VTOC fields are defined using size-invariant types
in order to reduce the 32/64 ioctl data-model mapping.
o NOTE: While the new 'extvtoc' API can support sizes much
larger than 2TB, on-disk dk_label limitations limit capacity
to 2TB. A device that is larger than 2TB can be supported in
a limited fashion by reducing the dk_label represented size
to 2TB.
* New ioctls for the 'extvtoc'. The disk capacity establishes the
VTOC mode of a disk: current (<1TB) or extended (>1TB).
Extened ioctls work in both modes.
* Behavior of current VTOC API for >1TB disks.
This associated changes will effect
* Disk target driver
* Interfaces: ioctls/libadm
* Utilities/Install
as described below.
4.2.1 Disk Target Drivers
The project proposes the following changes to target drivers
(mostly sd/ssd/cmdk drivers):
* Move the cutoff limit for VTOC labeling to 2TB.
* Support having a limited-to-2TB VTOC (and legacy MBR on x86) on
disks larger than 2TB.
* Allow the x86 only whole disk node /dev/rdsk/cXtYdZp0 to be
always accessible regardless of disk content.
4.2.2 Interfaces: ioctls/libadm
4.2.2.1 Data Structures and ioctl Behavior:
4.2.2.1.1 Current Limitations
For the current vtoc, the following fields are associated with size
limitations:
struct vtoc {
struct partition v_part[]{
daddr_t p_start;
long p_size;
}
}
struct part_info {
daddr_t p_start;
int p_length;
}
struct dk_allmap {
struct dk_map dka_map[] {
daddr_t dkl_nblk;
}
}
Since 'daddr_t' is a long, we have a 1TB limit for a 32-bit
application. Also, 'p_length' implies 1TB limit even in 64-bit
applications.
DKIOCSVTOC struct vtoc
DKIOCGVTOC struct vtoc
DKIOCPARTINFO struct part_info
read_vtoc struct vtoc
write_vtoc struct vtoc
DKIOCGAPART struct dk_allmap
DKIOCSAPART struct dk_allmap
The above ioctls/library functions (plus the geometry related
ioctls) do not allow partitioning a disk > 1TB with current VTOC
API.
4.2.2.1.2 Proposed Solution
Introduce new ioctls, library functions and data structures (except
for DKIOCGAPART and DKIOCSAPART, explanation further down).
DKIOCEXTSVTOC struct extvtoc
DKIOCGEXTVTOC struct extvtoc
DKIOCEXTPARTINFO struct extpart_info (x86 only)
read_extvtoc() struct extvtoc
write_extvtoc() struct extvtoc
Among possible alternative solutions, redefining struct vtoc with
unsigned values for start/size was considered, but was rejected in
order to maintain source compatibility with any third-party
applications consuming current interfaces.
4.2.2.1.2.1 Proposed Solution: Header file changes: vtoc.h
Add the following new structures, defines, and interfaces - using
32/64-bit size-invariant data types and 'diskaddr_t' for disk
addresses: (NOTE: 'ext' comments come from current VTOC API
wording)
struct extpartition {
ushort_t p_tag; /* ID tag of partition */
ushort_t p_flag; /* permission flags */
ushort_t p_pad[2];
diskaddr_t p_start; /* start sector no of partition */
diskaddr_t p_size; /* # of blocks in partition */
};
struct extvtoc {
uint64_t v_bootinfo[3]; /* info needed by mboot (unsupported) */
uint64_t v_sanity; /* to verify vtoc sanity */
uint64_t v_version; /* layout version */
char v_volume[LEN_DKL_VVOL]; /* volume name */
ushort_t v_sectorsz; /* sector size in bytes */
ushort_t v_nparts; /* number of partitions */
ushort_t pad[2];
uint64_t v_reserved[10];
struct extpartition v_part[V_NUMPAR]; /* partition headers */
uint64_t timestamp[V_NUMPAR]; /* partition timestamp (unsupported) */
char v_asciilabel[LEN_DKL_ASCII]; /* for compatibility */
};
#define V_EXTVERSION V_VERSION /* extvtoc layout version number */
#define VT_EOVERFLOW (-7) /* VTOC op. data struct limited */
extern int read_extvtoc(int, struct extvtoc *);
extern int write_extvtoc(int, struct extvtoc *);
4.2.2.1.2.2 Proposed Solution: Header file changes: dkio.h
The following new ioctls and data structures are defined:
#define DKIOCGEXTVTOC (DKIOC|23)
#define DKIOCSEXTVTOC (DKIOC|24)
#define DKIOCEXTPARTINFO (DKIOC|19)
struct extpart_info {
diskaddr_t p_start;
diskaddr_t p_length;
};
Proposed updates to dkio(7I) and read_vtoc(3EXT) man pages are in
materials directory.
4.2.2.1.2.3 Proposed Solution: Code Change Overview
Main changes when operating on >1TB disks:
* DKIOCSVTOC and DKIOCGVTOC (unless disk is GPT labeled) will
return EOVERFLOW instead of ENOTSUP. For read_vtoc and write_vtoc
a new return value, VT_OVERFLOW, is defined, to be returned when
DKIOCGVTOC/DKIOCSVTOC return EOVERFLOW. This is mainly a
notification mechanism for applications to use new interfaces
instead.
* DKIOCGGEOM/DKIOCSGEOM and DKIOCG_PHYGEOM will be supported.
* DKIOCPARTINFO will return EOVERFLOW [3]. Application should use
DKIOCEXTPARTINFO instead.
Please see ioctl_behavior_matrix.txt in materials directory for
specific details on ioctls behavior.
Rationale for not extending DKIOC(G/S)APART:
* DKIOCSAPART was EOLed by PSARC/2001/570 [1], although EOL was not
announced.
* A combination of other ioctls (DKIOCGGEOM, DKIOCGVTOC/DKIOCSVTOC)
can be used to achieve the same end results with some additional
calculations.
* These ioctls are very rarely used. Install does not utilize
them. In Nevada ON gate, there are 3 consumers, 2 of which are
for floppies; the third one is format (which we will change with
this project).
* DKIO ioctl numbers are precious, and we don't want to take more
if we don't have to.
4.2.3 Utilities/Install
4.2.3.1 format(1M):
format (-e) will allow labeling a disk with VTOC regardless of size
of the disk. Also, since the cutoff for supporting VTOC in target
drivers moved to 2TB, default label type for disks less than 2TB
will be VTOC.
4.2.3.2 fmthard(1M):
Since the cutoff for supporting VTOC in target drivers moved to
2TB, default label type for disks less than 2TB will be VTOC.
format should be used if user wants to enforce a VTOC on a disk >
2TB if it does not have a VTOC label (or legacy MBR on x86)
already.
4.2.3.3 fdisk(1M):
Currently fdisk will fail on a disk > 1TB with no GPT label. fdisk
also has issues with assuming signed 32-bit values instead of
unsigned for MBR partition information which limits it to 1TB.
The project proposes to allow fdisk to run on x86 regardless of
contents on the disk.fdisk and consumers of fdisk will also be
modified to use unsigned 32-bit values so we can handle up to 2TB.
4.2.3.4 SVM/metainit(1M):
SVM will be extended to create metadevices comprised of physical
disks with VTOC labels up to 2 TB. Metadevices fabricate label
information so that they can handle disk ioctl requests. SVM will
be changed to handle the DKIOCGEXTVTOC and DKIOCSEXTVTOC ioctls for
metadevices up to 2 TB.
Current behavior of the ioctls with metadevices:
<= 1 TB
DKIOCSVTOC - will always succeed
DKIOCGVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
DIKIOCSETEFI - will always succeed
DKIOCGETEFI - will always succeed unless a VTOC labeled (ENOTSUP)
> 1 TB
DKIOCSVTOC - will always fail (ENOTSUP)
DKIOCGVTOC - will always fail (ENOTSUP)
DIKIOCSETEFI - will always succeed
DKIOCGETEFI - will always succeed
Proposed behavior:
<= 1 TB
DKIOCSVTOC - will always succeed
DKIOCGVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
DKIOCSEXTVTOC - will always succeed
DKIOCGEXTVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
DIKIOCSETEFI - will always succeed
DKIOCGETEFI - will always succeed unless VTOC labeled (ENOTSUP)
> 1TB < 2 TB
DKIOCSVTOC - will always fail (EOVERFLOW)
DKIOCGVTOC - will always fail (EOVERFLOW) unless EFI/GPT labeled (ENOTSUP)
DKIOCSEXTVTOC - will always succeed
DKIOCGEXTVTOC - will always succeed unless EFI/GPT labeled (ENOTSUP)
DIKIOCSETEFI - will always succeed
DKIOCGETEFI - will always succeed unless VTOC labeled (ENOTSUP)
> 2 TB
DKIOCSVTOC - will always fail (EOVERFLOW)
DKIOCGVTOC - will always fail (EOVERFLOW) unless EFI/GPT labeled (ENOTSUP)
DKIOCSEXTVTOC - will always fail (ENOTSUP)
DKIOCGEXTVTOC - will always fail (ENOTSUP)
DIKIOCSETEFI - will always succeed
DKIOCGETEFI - will always succeed
There are no backward compatibility issues.
4.2.3.5 Install:
Since the cutoff for supporting VTOC in target drivers moved to
2TB, install will also use VTOC labels for disks up to 2TB.
GPT label support for install is not added by this phase of the
project; it will be addressed in a future phase (when we are
planning to move to GPT labels by default for boot on disks >
2TB).
4.3 Private Interfaces
4.3.1 dklabel.h , vtoc.h, altsctr.h dadkio.h
The on-disk VTOC label (dklabel.h) has the same issue with defining
some structure fields as signed 32-bit values (which is the root of
all 1TB limit VTOC limitations). In addition, structures defined
for DIOCTL_RWCMD and alternate sector handling have the same
issues.
We will use larger data types for the problematic fields and enable
them with a new preprocessor symbol test _EXTVTOC. This is done to
keep source compatibility with drivers that are already using these
structures and will not be modified for > 1 or 2 TB support (such
as pcmcia disk support).
The kernel macros vtoc32tovtoc and vtoctovtoc32 will be updated to
take the larger data types into account.
4.4 Interface classification
Interface exported Level Comments
struct extvtoc Evolving
struct extpart_info Evolving x86 only
DKIOCEXTSVTOC Evolving
DKIOCGEXTVTOC Evolving
DKIOCEXTPARTINFO Evolving x86 only
read_extvtoc() Evolving
write_extvtoc() Evolving
4.5 Backwards Compatibility
A disk > 1TB with a VTOC label expanded over 1TB is not supported
on prior releases; document in a release note
What happens if a >1TB disk with a VTOC label expanding over TB is
used in an older release? Here is what we have found experimentally
on a SCSI disk and by code inspection:
o X86: Current Nevada and S10U5 (both 64-bit):
o The VTOC label is not recognized in target driver.
o Normal open (with no non-block flag) of all disk minor nodes
will fail.
o Format will assume there is no VTOC on the disk and uses
EFI/GPT label to label the device.
o SPARC: Current Nevada:
o Warning message from target driver is issued.
o Target driver recognizes and accepts the VTOC.
o Normal open of disk minor nodes succeeds similar to <1TB VTOC.
o Mount and reading an existing ufs file system of > 1TB
succeed.
o Format gets confused (e.g an about 2TB disk is seen as 4.87 GB
drive, auto-configure without -e option fails...).
o S10U5:
o Warning message from target driver is issued.
o Target driver assumed the label is invalid.
o Normal open of disk minor nodes fails.
o Format -e does not recognize any label on the disk and gets
confused (e.g auto-configuration fails).
References:
[1] PSARC 2001/570 Multi-terabyte disk support
<http://sac.sfbay/PSARC/2001/570>
<http://www.opensolaris.org/os/community/arc/caselog/2001/570>
[2] VTOC changes
<http://sac.sfbay/PSARC/1991/062>
<http://www.opensolaris.org/os/community/arc/caselog/1991/062>
[3] 6691817 DKIOCPARTINFO can overflow ..."
<http://monaco.sfbay.sun.com/detail.jsf?cr=6691817>
<http://bugs.opensolaris.org/view_bug.do?bug_id=6691817>
[4] x86 Disk Layout
<http://sac.sfbay/PSARC/1993/015>
<http://www.opensolaris.org/os/community/arc/caselog/1993/015>
[5] Disk preparation, x86-specific
<http://sac.sfbay/PSARC/1993/019>
<http://www.opensolaris.org/os/community/arc/caselog/1993/019>
[6] Solaris Disk Label Conventions
<http://sac.sfbay/PSARC/1994/072>
<http://www.opensolaris.org/os/community/arc/caselog/1994/072>
[7] <http://en.wikipedia.org/wiki/Master_boot_record>
[8] <http://en.wikipedia.org/wiki/BSD_disklabel>
6. Resources and Schedule
6.4. Steering Committee requested information
6.4.1. Consolidation C-team Name:
ON
6.5. ARC review type: FastTrack
6.6. ARC Exposure: open