Block layer: integrity checking and lots of partitions

By Jonathan Corbet
July 15, 2008

One likes to think of disk drives as being a reliable store of data. As long as nothing goes so wrong as to let the smoke out of the device, blocks written to the disk really should come back with the same bits set in the same places. The reality of the situation is a bit less encouraging, especially when one is dealing with the sort of hardware which is available at the local computer store. Stories of blocks which have been corrupted, or which have been written to a location other than the one which was intended, are common.

For this reason, there is steady interest in filesystems which use checksums on data stored to block devices. Rather than take the device's word that it successfully stored and retrieved a block, the filesystem can compare checksums and be sure. A certain amount of checksumming is also done by paranoid applications in user space. The checksums used by BitKeeper are said to have caught a number of corruption problems; successor tools like git have checksums wired deeply into their data structures. If a disk drive corrupts a git repository, users will know about it sooner rather than later.

Checksums are a useful tool, but they have one minor problem: checksum failures tend to come when they are too late to be useful. By the time a filesystem or application notices that a disk block isn't quite what it once was, the original data may be long-gone and unrecoverable. But disk block corruption often happens in the process of getting the data to the disk; it would sure be nice if the disk itself could use a checksum to ensure that (1) the data got to the disk intact, and (2) the disk itself hasn't mangled it.

To that end, a few standards groups have put together schemes for the incorporation of data integrity checking into the hardware itself. These mechanisms generally take the form of an additional eight-byte checksum attached to each 512-byte block. The host system generates the checksum when it prepares a block for writing to the drive; that checksum will follow the data through the series of host controllers, RAID controllers, network fabrics, etc., with the hardware verifying the checksum along each step of the way. The checksum is stored with the data, and, when the data is read in the future, the checksum travels back with it, once again being verified at each step. The end result should be that data corruption problems are caught immediately, and in a way which identifies which component of the system is at fault.

Needless to say, this integrity mechanism requires operating system support. As of the 2.6.27 kernel, Linux will have such support, at least for SCSI and SATA drives, thanks to Martin Petersen. The well-written documentation file included with the data integrity patches envisions three places where checksum generation and verification can be performed: in the block layer, in the filesystem, and in user space. Truly end-to-end protection seems to need user-space verification, but, for now, the emphasis is on doing this work in the block layer or filesystem - though, as of this writing, no integrity-aware filesystems exist in the mainline repository.

Drivers for block devices which can manage integrity data need to register some information with the block layer. This is done by filling in a blk_integrity structure and passing it to blk_integrity_register(). See the document for the full details; in short, this structure contains two function pointers. generate_fn() generates a checksum for a block of data, and verify_fn() will verify a checksum. There are also functions for attaching a tag to a block - a feature supported by some drives. The data stored in the tag can be used by filesystem-level code to, for example, ensure that the block is really part of the file it is supposed to belong to.

The block layer will, in the absence of an integrity-aware filesystem, prepare and verify checksum data itself. To that end, the bio structure has been extended with a new bi_integrity field, pointing to a bio_vec structure describing the checksum information and some additional housekeeping. Happily, the integrity standards were written to allow the checksum information to be stored separately from the actual data; the alternative would have been to modify the entire Linux memory management system to accommodate that information. The bi_integrity area is where that information goes; scatter/gather DMA operations are used to transfer the checksum and data to and from the drive together.

Integrity-aware filesystems, when they exist, will be able to take over the generation and verification of checksum data from the block layer. A call to bio_integrity_prep() will prepare a given bio structure for integrity verification; it's then up to the filesystem to generate the checksum (for writes) or check it (for reads). There's also a set of functions for managing the tag data; again, see the document for the details.

Extended partitions

One of the more annoying and long-lived annoyances in the Linux block layer has been the limit on the number of partitions which can be created on any one device. IDE devices can handle up to 64 partitions, which is usually enough, but SCSI devices can only manage 16 - including one reserved for the full device. As these devices get larger, and as applications which benefit from filesystem isolation (virtualization, for example) become more popular, this limit only becomes more irksome.

The interesting thing is that the work needed to circumvent this problem was done some years ago when device numbers were extended to 32 bits. Some complicated schemes were proposed back in 2004 as a way of extending the number of partitions while not changing any existing device numbers, but that approach was never adopted. In the mean time, increasing use of tools like udev has pretty much eliminated the need for device number compatibility; on most distributions, there are no persistent device files anymore.

So when Tejun Heo revisited the partition limit problem, he didn't bother with obscure bit-shuffling schemes. Instead, with his patch set, block devices simply move to a new major device number and have all minor numbers dynamically assigned. That means that no block device has a stable (across boots) number; it also means that the minor numbers for partitions on the same device are not necessarily grouped together. But, since nobody really ever sees the device numbers on a contemporary distribution, none of this should matter.

Tejun's patch series is an interesting exercise in slowly evolving an interface toward a final goal, with a number of intermediate states. In the end, the API as seen by block drivers changes very little. There is a new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended partition numbers; once the number of minor numbers given to alloc_disk() is exhausted, any additional partitions will be numbered in the extended space. The intended use, though, would appear to be to allocate no traditional minor numbers at all - allocating disks with alloc_disk(0) - and creating all partitions in that extended space. Tejun's patch causes both the IDE and sd drivers to allocate gendisk structures in that way, moving all disks on most systems into the (shared) extended number space.

Even though modern distributions are comfortable with dynamic device numbers (and names, for that matter), it seems hard to imagine that a change like this would be entirely free of systems management problems across the full Linux user base. Distributors may still be a little nervous from the grief they took after the shift to the PATA drivers changed drive names on installed systems. So it's not really clear when Tejun's patches might make it into the mainline, or when distributors would make use of that functionality. The pressure for more partitions is unlikely to go away, though, so these patches may find their way in before too long.

[linuxkernelnewbies] Block layer: integrity checking and lots of partitions [LWN.net]

Block layer: integrity checking and lots of partitions

Extended partitions

Reply via email to