I tracked down a source of Git corrupting repositories to lock file
design not being robust when containers / PID namespaces are present.

In my case, the corruption stemmed from premature release of the `git
gc` lock in the gc.pid file. But since the lock file code for that file
is in gc.c, there could be other lock files in Git affected by the same
design limitation as well.

The lock design of gc.pid stores the current hostname and PID of the
locking process in the file. If another process comes along and its
hostname matches the stored hostname, it checks to see if the listed PID
exists. If the PID is missing, it assumes the lock is stale and releases
the lock.

A limitation with this approach is it isn't robust in the presence of
containers / PID namespaces. In containers, it is common for the
hostname to match the container host's hostname. Or the hostname will be
static string. In Kubernetes, all containers within a pod share the same
hostname. Containers (almost always) run in separate PID namespaces, so
PIDs from outside the container aren't visible to the container itself.
This means that if e.g. 2 `git gc` processes are running with the same
hostname in separate containers / PID namespaces, Git could prematurely
release the lock file because it thinks the "other" PID is dead and repo
corruption could ensue due to the 2 `git gc` processes racing with each
other.

The on-disk format of lock files obviously needs to be backwards
compatible with older clients. One backwards compatible solution is to
append something to the hostname to disambiguate containers / PID
namespaces. Mercurial appends the current PID namespace identifier to
the hostname [1] and my experience is that this is sufficient to
mitigate the issue. It is possible more robust solutions are achievable.

Gregory

[1] https://www.mercurial-scm.org/repo/hg/rev/1f151a33af8e

Reply via email to