Hi Jan,
On 2017/11/21 0:43, Jan Kara wrote:
> Hi Tao!
>
> On Fri 17-11-17 14:51:18, Hou Tao wrote:
>> On 2017/3/13 23:14, Jan Kara wrote:
>>> blkdev_open() may race with gendisk shutdown in two different ways.
>>> Either del_gendisk() has already unhashed block device inode (and thus
>>> bd_acquire() will end up creating new block device inode) however
>>> gen_gendisk() will still return the gendisk that is being destroyed.
>>> Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
>>> before we get to get_gendisk() and get_gendisk() will return new gendisk
>>> that got allocated for device that reused the device number.
>>>
>>> In both cases this will result in possible inconsistencies between
>>> bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
>>> destroyed and device number reused, in the second case immediately).
>>
>>> Fix the problem by checking whether the gendisk is still alive and inode
>>> hashed when associating bdev inode with it and its bdi. That way we are
>>> sure that we will not associate bdev inode with disk that got past
>>> blk_unregister_region() in del_gendisk() (and thus device number can get
>>> reused). Similarly, we will not associate bdev that was once associated
>>> with gendisk that is going away (and thus the corresponding bdev inode
>>> will get unhashed in del_gendisk()) with a new gendisk that is just
>>> reusing the device number.
>>
>> I have run some extended tests based on Omar Sandoval's script [1] these
>> days,
>> and found two new problems related with the race between bdev open and
>> gendisk
>> shutdown.
>>
>> The first problem lies in __blkdev_get(), and will cause use-after-free, or
>> worse oops. It happens because there may be two instances of gendisk related
>> with one bdev. When the process which owns the newer gendisk completes the
>> invocation of __blkdev_get() firstly, the other process which owns the older
>> gendisk will put the last reference of the older gendisk and cause
>> use-after-free.
>>
>> The following call sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda
>>
>> process A (blkdev_open()):
>> bd_acquire() returns new bdev_inode of /dev/sda
>> get_gendisk() returns gendisk (v1 gendisk)
>>
>> remove gendisk from bdev_map
>> device number is reused
>
> ^^ this is through blk_unregister_region() in del_gendisk(), isn't it?
Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().
>> process B (blkdev_open()):
>> bd_acquire() returns new bdev_inode of /dev/sda
>> get_gendisk() returns a new gendisk (v2 gendisk)
>> increase bdev->bd_openers
>
> So this needs to be a bit more complex - bd_acquire() must happen before
> bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
> happen after blk_unregister_region() from del_gendisk() happened. But yes,
> this seems possible.
>
>> process A (blkdev_open()):
>> find bdev->bd_openers != 0
>> put_disk(disk) // this is the last reference to v1 gendisk
>> disk_unblock_events(disk) // use-after-free occurs
>>
>> The problem can be fixed by an extra check showed in the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int
>> for_part)
>> {
>> if (!bdev->bd_openers) {
>> } else {
>> if (bdev->bd_disk != disk) {
>> ret = -ENXIO;
>> goto out_unlock_bdev;
>> }
>> }
>> }
>>
>> And we also can simplify the check in your patch by moving
>> blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
>> new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
>> or a new gendisk.
>
> I don't think we can do that. If we call blk_unregister_region() before
> bdev_unhash_inode(), the device number can get reused while bdev inode is
> still visible. So you can get that bdev inode v1 associated with gendisk
> v2. And open following unhashing of bdev inode v1 will get bdev inode v2
> associated with gendisk v2. So you have two different bdev inodes
> associated with the same gendisk and that will cause data integrity issues.
Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.
> But what you suggest above is probably a reasonable fix. Will you submit a
> proper patch please?
Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.
>> And the only check we need to do in __blkdev_get() is
>> checking the hash status of the bdev_inode after we get a gendisk from
>> get_gendisk(). The check can be done like the following snippet:
>>
>> static int __blkdev_get(struct block_device *bdev, fmode_t mode, int
>> for_part)
>> {
>> /* ...... */
>> ret = -ENXIO;
>> disk = get_gendisk(bdev->bd_dev, &partno);
>> if (!disk)
>> goto out;
>> owner = disk->fops->owner;
>>
>> if (inode_unhashed(bdev->bd_inode))
>> goto out_unmatched;
>> }
>>
>> The second problem lies in bd_start_claiming(), and will trigger the
>> BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
>> It occurs when testing the exclusive open and close of the disk and its
>> partitions.
>>
>> The cause of the problem is that a bdev_inode of a disk partition
>> corresponds with two instances of bdev_inode of the whole disk
>> simultaneously. When one pair of the partition inode and disk inode
>> completes the claiming, the other pair will be stumbled by the BUG_ON
>> assert in blkdev_get() because bdev->bd_holder is no longer NULL.
>>
>> The following sequences illustrate the problem:
>>
>> unhash the bdev_inode of /dev/sda2
>>
>> process A (blkdev_open()):
>> bd_acquire() returns a new bdev_inode for /dev/sda2
>> bd_start_claiming() returns bdev_inode of /dev/sda
>>
>> process B (blkdev_open()):
>> bd_acquire() returns the new bdev_inode for /dev/sda2
>>
>> unhash the bdev_inode of /dev/sda
>> remove gendisk from bdev_map
>> device number is reused
>>
>> process B (blkdev_open()):
>> bd_start_claming() returns a new bdev_inode for /dev/sda
>>
>> process A (blkdev_open()):
>> __blkdev_get() returns successfully
>> finish claiming // bdev->bd_holder = holder
>>
>> process B (blkdev_open()):
>> __blkdev_get() returns successfully
>> trigger BUG_ON(!bd_may_claim(bdev, whole, holder))
>
> Ah, good spotting. Essentially the whole->bd_claiming protection for
> partition bdev does not work because we've got two different 'whole' bdev
> pointers.
Yes, perfect summary.
>> And the problem can be fixed by adding more checks in bd_start_claiming():
>>
>> static struct block_device *bd_start_claiming(struct block_device *bdev,
>> void *holder)
>> {
>> /* ...... */
>> disk = get_gendisk(bdev->bd_dev, &partno);
>> if (!disk)
>> return ERR_PTR(-ENXIO);
>>
>> if (inode_unhashed(bdev->bd_inode)) {
>> module_put(disk->fops->owner);
>> put_disk(disk);
>> return ERR_PTR(-ENXIO);
>> }
>>
>> /* ...... */
>> if (!whole)
>> return ERR_PTR(-ENOMEM);
>>
>> if (inode_unhashed(bdev->bd_inode) ||
>> inode_unhashed(whole->bd_inode)) {
>> bdput(whole);
>> return ERR_PTR(-ENXIO);
>> }
>> }
>>
>> Except adding more and more checks, are there better solutions to the
>> race problem ? I have thought about managing the device number by
>> ref-counter, so the device number will not be reused until the last
>> reference of bdev_inode and gendisk are released, and i am trying to
>> figure out a way to get/put a reference of the device number when get/put
>> the block_device.
>>
>> So any suggests and thoughts ?
>
> Yeah, these races are nasty. I will think whether there's some reasonably
> easy way to fixing them or whether we'll have to go via some other route
> like blocking the device number as you suggest. In either case thanks for
> the report and the analysis!
Look forward to review them.
Regards,
Tao
> Honza
>