http://people.redhat.com/msnitzer/docs/io-limits.txt

I/O Limits: block sizes, alignment and I/O hints

TOC:
====
* Overview
* Userspace access
* Standards
* Stacking I/O Limits
* LVM
* Partition and Filesystem tools


Overview:
=========
The Linux I/O stack has been enhanced to consume vendor-provided
"I/O Limits" information that allows Linux tools (parted, lvm, mkfs.*,
etc) to optimize placement of and access to data.  I/O that is not
properly aligned relative to the device's "I/O Limits" will result in
reduced performance or, in the worst case, application failure (see:
"Direct I/O best practices" in "Userspace access" below).

Not all storage devices export this "I/O Limits" information yet.  Such
"legacy" devices will work fine given the various RHEL6 tools' defaults
will conservatively align all I/O on a 4K, or larger power of 2,
boundary.  Utilization of this "I/O Limits" information enables 4K
sector devices to be fully supported for data volumes.  Boot support for
4K sector devices is planned but not yet supported.  The kernel provides
both block device ioctl and sysfs access to each device's various "I/O
Limits".

I/O Limits
----------
Certain 4K sector devices may use a 4K 'physical_block_size' internally
but expose a finer-grained 512 byte 'logical_block_size' to Linux.  This
discrepancy introduces potential for misaligned I/O.  Linux will attempt
to start all data areas on a naturally aligned ('physical_block_size')
boundary by making sure it accounts for any 'alignment_offset' if the
beginning of the Linux block device is offset from the underlying
physical alignment.

Storage vendors can also supply "I/O hints" about a device's preferred
minimum unit for random I/O ('minimum_io_size') and streaming I/O
('optimal_io_size').  For example, these hints may correspond to a RAID
device's chunk size and stripe size respectively.


Userspace access
================
Direct I/O best practices
-------------------------
Users must always take care to use properly aligned and sized IO.  This
is especially important for Direct I/O access.  Direct I/O should be
aligned on a 'logical_block_size' boundary and in multiples of the
'logical_block_size'.  With native 4K devices (logical_block_size is 4K)
it is now critical that applications perform Direct I/O that is a
multiple of the device's 'logical_block_size'.  This means that
applications that do not perform 4K aligned I/O, but 512-byte aligned
I/O, will break with native 4K devices.  Applications may consult a
device's "I/O Limits" to ensure they are using properly aligned and
sized I/O.  The "I/O Limits" are exposed through both sysfs and block
device ioctl interfaces (also see: libblkid).

sysfs interface
---------------
/sys/block/<disk>/alignment_offset
/sys/block/<disk>/<partition>/alignment_offset
/sys/block/<disk>/queue/physical_block_size
/sys/block/<disk>/queue/logical_block_size
/sys/block/<disk>/queue/minimum_io_size
/sys/block/<disk>/queue/optimal_io_size

The kernel will still export these sysfs attribute for "legacy" devices
that do not provide "I/O Limits" information, for example:
alignment_offset:    0
physical_block_size: 512
logical_block_size:  512
minimum_io_size:     512
optimal_io_size:     0

block device ioctls
-------------------
BLKALIGNOFF: alignment_offset
BLKPBSZGET: physical_block_size
BLKSSZGET: logical_block_size
BLKIOMIN: minimum_io_size
BLKIOOPT: optimal_io_size


Standards
=========
ATA
---
ATA devices must report appropriate information via the IDENTIFY DEVICE
command.  ATA devices only report "I/O Limits" for 'physical_block_size',
'logical_block_size' and 'alignment_offset'.  The additional "I/O Hints"
are outside the scope of the ATA Command Set.

SCSI
----
The kernel's "I/O Limits" support requires at least version 3 of the
SCSI Primary Commands protocol (SPC-3).  Linux will only send a READ
CAPACITY(16) and "extended inquiry" (which gains access to the BLOCK
LIMITS VPD page) to devices which claim conformance to SPC-3.

1) READ CAPACITY(16) provides the block sizes and alignment offset:
LOGICAL BLOCK LENGTH IN BYTES:
/sys/block/<disk>/queue/physical_block_size

LOGICAL BLOCKS PER PHYSICAL BLOCK EXPONENT is used to derive:
/sys/block/<disk>/queue/logical_block_size

LOWEST ALIGNED LOGICAL BLOCK ADDRESS:
/sys/block/<disk>/alignment_offset
/sys/block/<disk>/<partition>/alignment_offset

2) BLOCK LIMITS VPD provides the "I/O hints":
OPTIMAL TRANSFER LENGTH GRANULARITY and OPTIMAL TRANSFER LENGTH are used
to derive:
/sys/block/<disk>/queue/minimum_io_size
/sys/block/<disk>/queue/optimal_io_size

The sg3_utils package provides the 'sg_inq' utility that can be used to
access the BLOCK LIMITS VPD page (0xb0), using:
 sg_inq -p 0xb0 <device>


Stacking I/O Limits
===================
All layers of the Linux I/O stack have been engineered to propagate the
various "I/O Limits" up the stack.  When a layer consumes an attribute
or aggregates many devices, it must expose appropriate "I/O Limits" so
that upper-layer devices or tools will have an accurate view of the
storage as it transformed.  Some practical examples are:
- only one layer in the I/O stack should adjust for a non-zero
  'alignment_offset'; once a layer adjusts for it it will export a
  device with an 'alignment_offset' of zero
- a striped Device Mapper (DM) device, created with LVM, must export
  a 'minimum_io_size' and 'optimal_io_size' relative to the stripe
  count (number of disks) and user provided chunk size

Linux Device Mapper (DM) and Software Raid (MD) device drivers can be
used to arbitrarily combine devices with different "I/O Limits".  The
kernel's block layer goes to great lengths to reasonably combine the
"I/O Limits" of the individual devices.  The kernel will not prevent
combining heterogenuous devices but the user should be aware of the risk
associated with doing so.

For instance, a 512 byte device and a 4K device may be combined into a
single logical DM device; the resulting DM device would have a
'logical_block_size' of 4K.  Filesystems layered on such a hybrid device
assume that 4K will be written atomically but in reality it will span 8
LBAs when issued to the 512 byte device.  Using a 4K 'logical_block_size'
for the higher-level DM device increases potential for a partial write
to the 512b device if there is a system crash.

If combining multiple devices' "I/O Limits" results in a conflict the
block layer may report a warning that the device is susceptible to
partial writes and/or misaligned.


Logical Volume Manager (LVM)
============================
LVM provides userspace tools that are used to manage the kernel's DM
devices.  LVM will shift the start of the data area, that a given DM
device will use, to account for a non-zero 'alignment_offset' associated
with any device LVM manages.  This means LVM logical volumes will be
properly aligned (alignment_offset=0).  LVM will adjust for any
'alignment_offset' by default but this may be disabled through
lvm.conf's 'data_alignment_offset_detection'.  Disabling this is not
recommended.

LVM will also detect the "I/O hints" for a device.  The start of a
device's data area will be a multiple of the 'minimum_io_size' or
'optimal_io_size' exposed in sysfs.  'minimum_io_size' is used if
'optimal_io_size' is undefined (0).  LVM will automatically determine
these "I/O hints" by default but this may be disabled through lvm.conf's
'data_alignment_detection'.  Disabling this is not recommended.


Partition and Filesystem tools
==============================
util-linux-ng's libblkid and fdisk
----------------------------------
The libblkid library provided with the util-linux-ng package includes a
programmatic API to access a device's "I/O Limits".  libblkid allows
applications, especially those that use Direct I/O, to properly size
their I/O requests.  util-linux-ng's fdisk uses libblkid to determine a
device's "I/O Limits" for optimal placement of all partitions.  If a
device doesn't provide "I/O Limits" information fdisk will align all
partitions on a 1MB boundary.

parted and libparted
--------------------
parted's libparted also uses libblkid's "I/O Limits" API.  The RHEL6
installer (anaconda) uses libparted.  This means that all partitions
created with either the installer or parted will be properly aligned.
The default alignment for all partitions created on a device that
doesn't appear to provide "I/O Limits" information will be be 1MB.

The heuristic parted uses is:
1)  Always use the reported 'alignment_offset' as the offset for the
    start of the first primary partition.
2a) If 'optimal_io_size' is defined (not 0) align all partitions on an
    'optimal_io_size' boundary.
2b) If 'optimal_io_size' is undefined (0) and 'alignment_offset' is 0
    and 'minimum_io_size' is a power of 2: use a 1MB default alignment.
    - as you can see this is the catch all for "legacy" devices which
      don't appear to provide "I/O hints"; so in the default case all
      partitions will align on a 1MB boundary.
    - NOTE: we can't distinguish between a "legacy" device and modern
      device that provides "I/O hints" with alignment_offset=0 and
      optimal_io_size=0.  Such a device might be a single SAS 4K device.
      So worst case we lose < 1MB of space at the start of the disk.

Filesystem tools
----------------
mkfs.ext[234], mkfs.xfs, and mkfs.gfs2 have been enhanced to consume a
device's "I/O Limits".  Linux filesystems are not allowed to be
formatted to use a block size that is smaller than the underlying
storage's 'logical_block_size'.  mkfs.ext[234] and mkfs.xfs also use the
"I/O hints" to layout ondisk data structure and data areas relative to
the underlying storage's 'minimum_io_size' and 'optimal_io_size' -- this
allows filesystems to be optimally formatted for various RAID (striped)
layouts.

Reply via email to