On 18/02/2017 03:48, Peter Karman wrote:
We already write the PID to Lucy lock files, so we can check if the process
that created the lock is still running, yes?
Or is that the very heuristic that you're wanting to move away from?
Yes, because it's unreliable. We don't detect whether another unrelated
process happens to reuse the PID. For example:
- We have an Indexer with PID 42.
- The machine crashes during indexing, leaving a lockfile with PID 42.
- The machine restarts and happens to assign PID 42 to another process
before an Indexer runs.
- Any new Indexer will be locked out as long as this other process is
running.
Right now, the only remedy is to manually delete the lock file. Fortunately,
this scenario is unlikely if only the indexing process terminates abnormally,
because PIDs won't get reused until they wrap around. Even if there's a system
crash, there's a good chance that an Indexer is started before the old PID is
reused.
On a shared volume like NFS, this problem is more pronounced. A single machine
that goes down or loses its network connection in the wrong moment will block
all other indexers until it gets back up and starts an indexing session.
Native locks are released by the operating system if a process crashes. This
even works on NFS after a certain timeout with modern client and server
implementations. Other than that, native shared locks should be faster on NFS.
Nick