Hi Jan,
On 2017/3/13 23:14, Jan Kara wrote:
> blkdev_open() may race with gendisk shutdown in two different ways.
> Either del_gendisk() has already unhashed block device inode (and thus
> bd_acquire() will end up creating new block device inode) however
> gen_gendisk() will still return the gendisk that is being destroyed.
> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
> before we get to get_gendisk() and get_gendisk() will return new gendisk
> that got allocated for device that reused the device number.
>
> In both cases this will result in possible inconsistencies between
> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
> destroyed and device number reused, in the second case immediately).
> Fix the problem by checking whether the gendisk is still alive and inode
> hashed when associating bdev inode with it and its bdi. That way we are
> sure that we will not associate bdev inode with disk that got past
> blk_unregister_region() in del_gendisk() (and thus device number can get
> reused). Similarly, we will not associate bdev that was once associated
> with gendisk that is going away (and thus the corresponding bdev inode
> will get unhashed in del_gendisk()) with a new gendisk that is just
> reusing the device number.
I have run some extended tests based on Omar Sandoval's script [1] these days,
and found two new problems related with the race between bdev open and gendisk
shutdown.
The first problem lies in __blkdev_get(), and will cause use-after-free, or
worse oops. It happens because there may be two instances of gendisk related
with one bdev. When the process which owns the newer gendisk completes the
invocation of __blkdev_get() firstly, the other process which owns the older
gendisk will put the last reference of the older gendisk and cause
use-after-free.
The following call sequences illustrate the problem:
unhash the bdev_inode of /dev/sda
process A (blkdev_open()):
bd_acquire() returns new bdev_inode of /dev/sda
get_gendisk() returns gendisk (v1 gendisk)
remove gendisk from bdev_map
device number is reused
process B (blkdev_open()):
bd_acquire() returns new bdev_inode of /dev/sda
get_gendisk() returns a new gendisk (v2 gendisk)
increase bdev->bd_openers
process A (blkdev_open()):
find bdev->bd_openers != 0
put_disk(disk) // this is the last reference to v1 gendisk
disk_unblock_events(disk) // use-after-free occurs
The problem can be fixed by an extra check showed in the following snippet:
static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
if (!bdev->bd_openers) {
} else {
if (bdev->bd_disk != disk) {
ret = -ENXIO;
goto out_unlock_bdev;
}
}
}
And we also can simplify the check in your patch by moving
blk_unregister_region()
before bdev_unhash_inode(), so whenever we get a new bdev_inode, the gendisk
returned
by get_gendisk() will either be NULL or a new gendisk. And the only check we
need to do
in __blkdev_get() is checking the hash status of the bdev_inode after we get a
gendisk
from get_gendisk(). The check can be done like the following snippet:
static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
/* ...... */
ret = -ENXIO;
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
goto out;
owner = disk->fops->owner;
if (inode_unhashed(bdev->bd_inode))
goto out_unmatched;
}
The second problem lies in bd_start_claiming(), and will trigger the BUG_ON
assert in
blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)). It occurs when testing
the
exclusive open and close of the disk and its partitions.
The cause of the problem is that a bdev_inode of a disk partition corresponds
with two
instances of bdev_inode of the whole disk simultaneously. When one pair of the
partition
inode and disk inode completes the claiming, the other pair will be stumbled by
the BUG_ON
assert in blkdev_get() because bdev->bd_holder is no longer NULL.
The following sequences illustrate the problem:
unhash the bdev_inode of /dev/sda2
process A (blkdev_open()):
bd_acquire() returns a new bdev_inode for /dev/sda2
bd_start_claiming() returns bdev_inode of /dev/sda
process B (blkdev_open()):
bd_acquire() returns the new bdev_inode for /dev/sda2
unhash the bdev_inode of /dev/sda
remove gendisk from bdev_map
device number is reused
process B (blkdev_open()):
bd_start_claming() returns a new bdev_inode for /dev/sda
process A (blkdev_open()):
__blkdev_get() returns successfully
finish claiming // bdev->bd_holder = holder
process B (blkdev_open()):
__blkdev_get() returns successfully
trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
And the problem can be fixed by adding more checks in bd_start_claiming():
static struct block_device *bd_start_claiming(struct block_device *bdev,
void *holder)
{
/* ...... */
disk = get_gendisk(bdev->bd_dev, &partno);
if (!disk)
return ERR_PTR(-ENXIO);
if (inode_unhashed(bdev->bd_inode)) {
module_put(disk->fops->owner);
put_disk(disk);
return ERR_PTR(-ENXIO);
}
/* ...... */
if (!whole)
return ERR_PTR(-ENOMEM);
if (inode_unhashed(bdev->bd_inode) ||
inode_unhashed(whole->bd_inode)) {
bdput(whole);
return ERR_PTR(-ENXIO);
}
}
Except adding more and more checks, are there better solutions to the race
problem ?
I have thought about managing the device number by ref-counter, so the device
number
will not be reused until the last reference of bdev_inode and gendisk are
released,
and i am trying to figure out a way to get/put a reference of the device number
when
get/put the block_device.
So any suggests and thoughts ?
Regards,
Tao
---
1: https://marc.info/?l=linux-block&m=148554717109098&w=2
> Also add a warning that will tell us about unexpected inconsistencies
> between bdi associated with the bdev inode and bdi associated with the
> disk.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/block_dev.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 53e2389ae4d4..5ec8750f5332 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1560,7 +1560,8 @@ static int __blkdev_get(struct block_device *bdev,
> fmode_t mode, int for_part)
> if (!partno) {
> ret = -ENXIO;
> bdev->bd_part = disk_get_part(disk, partno);
> - if (!bdev->bd_part)
> + if (!(disk->flags & GENHD_FL_UP) || !bdev->bd_part ||
> + inode_unhashed(bdev->bd_inode))
> goto out_clear;
>
> ret = 0;
> @@ -1614,7 +1615,8 @@ static int __blkdev_get(struct block_device *bdev,
> fmode_t mode, int for_part)
> bdev->bd_contains = whole;
> bdev->bd_part = disk_get_part(disk, partno);
> if (!(disk->flags & GENHD_FL_UP) ||
> - !bdev->bd_part || !bdev->bd_part->nr_sects) {
> + !bdev->bd_part || !bdev->bd_part->nr_sects ||
> + inode_unhashed(bdev->bd_inode)) {
> ret = -ENXIO;
> goto out_clear;
> }
> @@ -1623,6 +1625,9 @@ static int __blkdev_get(struct block_device *bdev,
> fmode_t mode, int for_part)
>
> if (bdev->bd_bdi == &noop_backing_dev_info)
> bdev->bd_bdi = bdi_get(disk->queue->backing_dev_info);
> + else
> + WARN_ON_ONCE(bdev->bd_bdi !=
> + disk->queue->backing_dev_info);
> } else {
> if (bdev->bd_contains == bdev) {
> ret = 0;
>