Hi Ilya,

It turns out that sgdisk 0.8.6 -i 2 /dev/vdb removes partitions and re-adds 
them on CentOS 7 with a 3.10.0-229.11.1.el7 kernel, in the same way partprobe 
does. It is used intensively by ceph-disk and inevitably leads to races where a 
device temporarily disapears. The same command (sgdisk 0.8.8) on Ubuntu 14.04 
with a 3.13.0-62-generic kernel only generates two udev change events and does 
not remove / add partitions. The source code between sgdisk 0.8.6 and sgdisk 
0.8.8 did not change in a significant way and the output of strace -e ioctl 
sgdisk -i 2 /dev/vdb is identical in both environments.

ioctl(3, BLKGETSIZE, 20971520)          = 0
ioctl(3, BLKGETSIZE64, 10737418240)     = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, HDIO_GETGEO, {heads=16, sectors=63, cylinders=16383, start=0}) = 0
ioctl(3, BLKGETSIZE, 20971520)          = 0
ioctl(3, BLKGETSIZE64, 10737418240)     = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKGETSIZE, 20971520)          = 0
ioctl(3, BLKGETSIZE64, 10737418240)     = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0
ioctl(3, BLKSSZGET, 512)                = 0

This leads me to the conclusion that the difference is in how the kernel reacts 
to these ioctl.

What do you think ? 

Cheers

On 17/12/2015 17:26, Ilya Dryomov wrote:
> On Thu, Dec 17, 2015 at 3:10 PM, Loic Dachary <l...@dachary.org> wrote:
>> Hi Sage,
>>
>> On 17/12/2015 14:31, Sage Weil wrote:
>>> On Thu, 17 Dec 2015, Loic Dachary wrote:
>>>> Hi Ilya,
>>>>
>>>> This is another puzzling behavior (the log of all commands is at
>>>> http://tracker.ceph.com/issues/14094#note-4). in a nutshell, after a
>>>> series of sgdisk -i commands to examine various devices including
>>>> /dev/sdc1, the /dev/sdc1 file disappears (and I think it will showup
>>>> again although I don't have a definitive proof of this).
>>>>
>>>> It looks like a side effect of a previous partprobe command, the only
>>>> command I can think of that removes / re-adds devices. I thought calling
>>>> udevadm settle after running partprobe would be enough to ensure
>>>> partprobe completed (and since it takes as much as 2mn30 to return, I
>>>> would be shocked if it does not ;-).
> 
> Yeah, IIRC partprobe goes through every slot in the partition table,
> trying to first remove and then add the partition back.  But, I don't
> see any mention of partprobe in the log you referred to.
> 
> Should udevadm settle for a few vd* devices be taking that much time?
> I'd investigate that regardless of the issue at hand.
> 
>>>>
>>>> Any idea ? I desperately try to find a consistent behavior, something
>>>> reliable that we could use to say : "wait for the partition table to be
>>>> up to date in the kernel and all udev events generated by the partition
>>>> table update to complete".
>>>
>>> I wonder if the underlying issue is that we shouldn't be calling udevadm
>>> settle from something running from udev.  Instead, of a udev-triggered
>>> run of ceph-disk does something that changes the partitions, it
>>> should just exit and let udevadm run ceph-disk again on the new
>>> devices...?
> 
>>
>> Unless I missed something this is on CentOS 7 and ceph-disk is only called 
>> from udev as ceph-disk trigger which does nothing else but asynchronously 
>> delegate the work to systemd. Therefore there is no udevadm settle from 
>> within udev (which would deadlock and timeout every time... I hope ;-).
> 
> That's a sure lockup, until one of them times out.
> 
> How are you delegating to systemd?  Is it to avoid long-running udev
> events?  I'm probably missing something - udevadm settle wouldn't block
> on anything other than udev, so if you are shipping work off to
> somewhere else, udev can't be relied upon for waiting.
> 
> Thanks,
> 
>                 Ilya
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to