> On Jan 22, 2018, at 18:34, Jens Axboe <[email protected]> wrote:
> 
> On 1/22/18 4:31 PM, David Zarzycki wrote:
>> Hello,
>> 
>> I previously reported a hang when building LLVM+clang on a block multi-queue 
>> device (NVMe _or_ loopback onto tmpfs with the ’none’ scheduler).
>> 
>> I’ve since updated the kernel to 4.15-rc9, merged the ‘blkmq/for-next’ 
>> branch, disabled nohz_full parameter (used for testing), and tried again. 
>> Both NVMe and loopback now lock up hard (ext4 if it matters). Here are the 
>> backtraces:
>> 
>> NVMe:      http://znu.io/IMG_0366.jpg
>> Loopback:  http://znu.io/IMG_0367.jpg
> 
> I tried to reproduce this today using the exact recipe that you provide,
> but it ran fine for hours. Similar setup, nvme on a dual socket box
> with 48 threads.

Hi Jens,

Thanks for the quick reply and thanks for trying to reproduce this. I’m not 
sure if this makes a difference, but this dual Skylake machine has 96 threads, 
not 48 threads. Also, just to be clear, NVMe doesn’t seem to matter. I hit this 
bug with a tmpfs loopback device set up like so:

dd if=/dev/zero bs=1024k count=10000 of=/tmp/loopdisk
losetup /dev/loop0 /tmp/loopdisk
echo none > /sys/block/loop0/queue/scheduler
mkfs -t ext4 -L loopy /dev/loop0
mount /dev/loop0 /l
### build LLVM+clang in /l
### 'ninja check-all’ in a loop in /l

(No swap is setup because the machine has 192 GiB of RAM.)

> 
>> What should I try next to help debug this?
> 
> This one looks different than the other one. Are you sure your hw is sane?

I can build LLVM+clang in /tmp (tmpfs) reliably which suggests the the 
fundamental hardware is sane. It’s only when the software multi-queue layer 
gets involved that I see quick crashes/hangs.

As for the different backtraces, that's probably because I removed nohz_full 
from the kernel boot parameters.

> I'd probably try and enable lockdep debugging etc and see if you catch 
> anything.

Thanks. I turned on lockdep plus other lock debugging. Here is the resulting 
backtrace:

http://znu.io/IMG_0368.jpg

Here is the resulting backtrace with transparent huge pages disabled:

http://znu.io/IMG_0369.jpg

Here is the resulting backtrace with transparent huge pages disabled AND with 
systemd-coredumps disabled too:

http://znu.io/IMG_0370.jpg

I’m open to trying anything at this point. Thanks for helping,
Dave

Reply via email to