Hi, sorry for messing up years. lslocks only showed makes locking /dev/null, but it appears to be that the culprit is a running dockerd daemon. I dont understand why, but with the service disabled a blocked make will suddenly continue.
to install the service: echo > /etc/apt/sources.list.d/docker.list 'deb [arch=amd64] https://apt.dockerproject.org/repo/ debian-stretch main' apt-get update; apt-get install docker-engine For completeness, the lslocks output: $ lslocks COMMAND PID TYPE SIZE MODE M START END PATH zeitgeist-fts 1685 POSIX 15.2M READ 0 1073741826 1073742335 /home/noppl/.local/share/zeitgeist/activity.sqlite zeitgeist-fts 1685 POSIX 32K READ 0 128 128 /home/noppl/.local/share/zeitgeist/activity.sqlite-shm chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/data_reduction_proxy_leveldb/LOCK chromium 1872 POSIX 16.7M WRITE 0 1073741824 1073742335 /home/noppl/.config/chromium/Default/History atd 742 POSIX 4B WRITE 0 0 0 /run/atd.pid tracker-store 1609 POSIX 256.5M READ 0 1073741826 1073742335 /home/noppl/.cache/tracker/meta.db tracker-store 1609 POSIX 32K READ 0 128 128 /home/noppl/.cache/tracker/meta.db-shm zeitgeist-datah 1593 POSIX 15.2M READ 0 1073741826 1073742335 /home/noppl/.local/share/zeitgeist/activity.sqlite zeitgeist-datah 1593 POSIX 32K READ 0 128 128 /home/noppl/.local/share/zeitgeist/activity.sqlite-shm chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/Service Worker/Database/LOCK chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/Session Storage/LOCK libvirtd 955 POSIX 3B WRITE 0 0 0 /run/libvirtd.pid chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/GCM Store/LOCK zeitgeist-fts 1685 OFDLCK 0B WRITE 0 0 0 /home/noppl/.local/share/zeitgeist/fts.index/flintlock chromium 1872 POSIX 352K WRITE 0 1073741824 1073742335 /home/noppl/.config/chromium/Default/Web Data chromium 1872 POSIX 3.6M WRITE 0 1073741824 1073742335 /home/noppl/.config/chromium/Default/Sync Data/SyncData.sqlite3 cron 728 FLOCK 4B WRITE 0 0 0 /run/crond.pid chromium 1872 POSIX 124K WRITE 0 1073741824 1073742335 /home/noppl/.config/chromium/Default/Login Data chromium 1872 POSIX 13.6M READ 0 1073741826 1073742335 /home/noppl/.config/chromium/Default/Favicons chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/Extension State/LOCK chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/File System/041/t/Paths/LOCK rpcbind 689 FLOCK 0B WRITE 0 0 0 /run/rpcbind.lock zeitgeist-daemo 1651 POSIX 15.2M READ 0 1073741826 1073742335 /home/noppl/.local/share/zeitgeist/activity.sqlite zeitgeist-daemo 1651 POSIX 32K READ 0 128 128 /home/noppl/.local/share/zeitgeist/activity.sqlite-shm chromium 1872 POSIX 0B WRITE 0 0 0 /home/noppl/.config/chromium/Default/File System/Origins/LOCK chromium 1872 POSIX 736K WRITE 0 1073741824 1073742335 /home/noppl/.config/chromium/Default/Shortcuts dockerd 3732 OFDLCK READ 0 0 0 /dev... dockerd 3732 FLOCK 128K WRITE 0 0 0 /var/lib/docker/volumes/metadata.db 2017-02-18 1:34 GMT+01:00 James Cowgill <jcowg...@debian.org>: > Hi, > > On 17/02/17 18:08, Norbert Lange wrote: >> Hello, >> >> Tried reproducing it at work (where it first happened on a build server). >> On my PC at home with 4 cores / 12 thread the bug reproduces always >> On a 6 core / 12 threads Xeon Server the bug reproduces always >> On my work PC with 4 cores / 4 threads running in a VMware Instance it >> doesnt reproduce. >> All running Debian Stretch with current updates. >> >> Maybe you want to add infos about your system? >> From the sample of 3: Hyperthreading or >= 8 threads or runnin on bare >> metal instead of in a VM could provoke the bug. > > Originally the system I tried it on has 8 cores (can't remember number > of threads), but I tried it on machines with 2 cores and one with 16 and > it worked on all of them. I don't think the number of cores is relevant > here. > >> Further make 4.1 was uploaded to Debian Stretch on 16h january, the >> issue appeared on 19th january on the server. >> So disregard what I said about this not being an upstream issue - its >> actually quite possible. > > Have you muddled years up here? 4.1 was uploaded on 16th Jan *2016*. > >> Heres a dump via attached gdb (step wont do anything so it seems that >> the thread is blocked): >> >> (gdb) thread apply all bt >> >> Thread 1 (process 12177): >> #0 0x00007f476c156962 in do_fcntl (fd=1, cmd=7, arg=0x5595eae95ea0) >> at ../sysdeps/unix/sysv/linux/fcntl.c:31 > > This is fcntl(stdout = /dev/null, F_SETLKW, <struct flock>) > > It seems that "make -O" attempts to lock stdout before writing to it so > that multiple make processes can cooperate on who gets to write any > output. If it's hanging here, then someone must already be holding the lock. > > Please can you give the output of "lslocks" on the machines that fail. > There might be an entry for /dev/null which will point at the culprit. > Failing that, an "strace -f" would be useful so we can see all the calls > made to fcntl. > >> I`ll have to compile make with debuginfo if you need more (gonna take >> a few days) > > I don't need any debug information, but you may be interested in this: > https://wiki.debian.org/AutomaticDebugPackages > > So if you add this apt source: > deb http://deb.debian.org/debian-debug/ unstable-debug main > > You can then install make-dbgsym to get the debug symbols for make > without recompiling anything. > > Thanks, > James > >> 2017-02-17 15:24 GMT+01:00 James Cowgill <jcowg...@debian.org>: >>> On 16/02/17 21:52, Norbert Lange wrote: >>>> Package: make >>>> Version: 4.1-9 >>>> Severity: important >>>> >>>> Dear Maintainer, >>>> >>>> running the attached Makefile will hang the process, >>>> if multiple jobs are used then the process wont respond to a >>>> TERM and has to be killed. >>>> >>>> The very same issue is observed with make-guile. >>>> >>>> I believe this to not be an upstream bug, since I observed this >>>> only a couple weeks ago after an upgrade. >>>> Unfortunatly I can`t pinpoint a date or version. >>> >>> I cannot reproduce this bug. >>> >>> Also, make has not been updated in testing for almost a year so if it >>> only started happening recently, something else probably caused it. >>> >>> Running 'make -d -O' (although this may be difficult if the bug requires >>> redirection to /dev/null) or the output or running make inside gdb and >>> finding where it hangs might help in diagnosing this. >>> >>> Thanks, >>> James >