On 11 Aug 2010, at 17:21, Simon Wilkinson wrote:
>
> Once you've applied this, I would be interested to know what error your test
> now returns ...
I'm still interesting in the error code you're seeing, but on further analysis,
I think I've identified two problems. They're both related to race conditions
in the way that we enrol AFS locks with the kernel's local lock management
system (we do this so that the kernel can handle byte-range locks on the local
machine for us).
The first is that locks and unlocks can race against each other. On a lock we
do SetAFSLock, SetKernelLock. On unlock we do ReleaseAFSLock,
ReleaseKernelLock. However, we don't hold any locks on the file whilst we do
so. Multiple calls to set a lock are safe, as the SetAFSLock serialises them.
However, a lock and an unlock may race each other. In this case we have
Process A Process B
SetAFSLock
SetKernelLock
....
ReleaseAFSLock
SetAFSLock
SetKernelLock
ReleaseKernelLock
Process B can't get the kernel lock, despite the fact that it has the AFS lock,
because process A hasn't released it yet. So you get an error message.
The second problem is a similar race, but related to what happens when we close
a file handle. We don't actually clean up any of the kernel file locks
ourselves - instead, we let the kernel do so when it disposes of the file
descriptor. However, we do release any file server locks that we might have.
Between us releasing the fileserver locks, and the kernel freeing it's locks,
there's an opportunity for another process to gain a fileserver lock, but not a
local one, and you'll get an error back there.
I think that it's the second problem that your test is hitting. Sadly this
problem is the harder one to fix, as it requires refactoring the way that we
interface with the Linux lock management code.
Cheers,
Simon.
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel