Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
On Fri, Jun 26, 2020 at 08:01:18PM +0100, Steve McIntyre wrote: On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote: On 6/26/20 9:50 AM, Steve McIntyre wrote: OK, will try that second... I have not been able to reproduce but So looking at linux-4.19.y it looks like 1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check was picked without ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Both of them are marked as Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") so I would expect them to be picked together. ptraceme is potentially updating the task's cred while the access check is running. Try building after picking ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Bingo! With that one change the test suite runs to completion, no lockup. I've queued ca3fde5214e1 for 4.19 and 4.14, thank you for your work! -- Thanks, Sasha
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote: >On 6/26/20 9:50 AM, Steve McIntyre wrote: >> >> OK, will try that second... >> > >I have not been able to reproduce but > >So looking at linux-4.19.y it looks like >1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check > >was picked without >ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check > >Both of them are marked as >Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") > >so I would expect them to be picked together. > >ptraceme is potentially updating the task's cred while the access check is >running. > >Try building after picking >ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Bingo! With that one change the test suite runs to completion, no lockup. \o/ Thanks guys, I think we've found the cause here. -- Steve McIntyre, Cambridge, UK.st...@einval.com "The whole problem with the world is that fools and fanatics are always so certain of themselves, and wiser people so full of doubts." -- Bertrand Russell
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
Hi again, On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote: >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH wrote: >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > >... > >>> > Considering I'm running strace build tests to provoke this bug, >>> > finding the failure in a commit talking about ptrace changes does look >>> > very suspicious...! >>> > >>> > Annoyingly, I can't reproduce this on my disparate other machines >>> > here, suggesting it's maybe(?) timing related. >> >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >>prints a warning to dmesg? If so, can you share that warning? > >I mean the machine locks hard - X stops updating, the mouse/keyboard >stop responding. No pings, etc. When I reboot, there's nothing in the >logs. > >>If you don't have any way to see console output, and you don't have a >>working serial console setup or such, you may want to try re-running >>those tests while the kernel is booted with netconsole enabled to log >>to a different machine over UDP (see >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > >ACK, will try that now for you. > >>You may want to try setting the sysctl kernel.sysrq=1 , then when the >>system has locked up, press ALT+PRINT+L (to generate stack traces for >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and >>ALT+PRINT+W (to collect more information about active tasks). > >Nod. > >>(If you share stack traces from these things with us, it would be >>helpful if you could run them through scripts/decode_stacktrace.pl >>from the kernel tree first, to add line number information.) > >ACK. Output passed through scripts/decode_stacktrace.sh attached. Just about to try John's suggestion next. -- Steve McIntyre, Cambridge, UK.st...@einval.com Dance like no one's watching. Encrypt like everyone is. - @torproject tack.log.decoded.gz Description: application/gzip
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
On 6/26/20 9:50 AM, Steve McIntyre wrote: > Hi Jann, > > On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >> On Fri, Jun 26, 2020 at 3:41 PM Greg KH wrote: >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > > ... > Considering I'm running strace build tests to provoke this bug, finding the failure in a commit talking about ptrace changes does look very suspicious...! Annoyingly, I can't reproduce this on my disparate other machines here, suggesting it's maybe(?) timing related. >> >> Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >> prints a warning to dmesg? If so, can you share that warning? > > I mean the machine locks hard - X stops updating, the mouse/keyboard > stop responding. No pings, etc. When I reboot, there's nothing in the > logs. > >> If you don't have any way to see console output, and you don't have a >> working serial console setup or such, you may want to try re-running >> those tests while the kernel is booted with netconsole enabled to log >> to a different machine over UDP (see >> https://www.kernel.org/doc/Documentation/networking/netconsole.txt). > > ACK, will try that now for you. > >> You may want to try setting the sysctl kernel.sysrq=1 , then when the >> system has locked up, press ALT+PRINT+L (to generate stack traces for >> all active CPUs from NMI context), and maybe also ALT+PRINT+T and >> ALT+PRINT+W (to collect more information about active tasks). > > Nod. > >> (If you share stack traces from these things with us, it would be >> helpful if you could run them through scripts/decode_stacktrace.pl >>from the kernel tree first, to add line number information.) > > ACK. > >> Trying to isolate the problem: >> >> __end_current_label_crit_section and end_current_label_crit_section >> are aliases of each other (via #define), so that line change can't >> have done anything. >> >> That leaves two possibilities AFAICS: >> - the might_sleep() call by itself is causing issues for one of the >> remaining users of begin_current_label_crit_section() (because it >> causes preemption to happen more often where it didn't happen on >> PREEMPT_VOLUNTARY before, or because it's trying to print a warning >> message with the runqueue lock held, or something like that) >> - the lack of "if (aa_replace_current_label(label) == 0) >> aa_put_label(label);" in __begin_current_label_crit_section() is >> somehow causing issues >> >> You could try to see whether just adding the might_sleep(), or just >> replacing the begin_current_label_crit_section() call with >> __begin_current_label_crit_section(), triggers the bug. >> >> >> If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >> that isn't already set in your kernel config -, that might help track >> down the problem, unless it magically makes the problem stop >> triggering (which I guess would be conceivable if this indeed is a >> race). > > OK, will try that second... > I have not been able to reproduce but So looking at linux-4.19.y it looks like 1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check was picked without ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check Both of them are marked as Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") so I would expect them to be picked together. ptraceme is potentially updating the task's cred while the access check is running. Try building after picking ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
Hi Jann, On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote: >On Fri, Jun 26, 2020 at 3:41 PM Greg KH wrote: >> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: ... >> > Considering I'm running strace build tests to provoke this bug, >> > finding the failure in a commit talking about ptrace changes does look >> > very suspicious...! >> > >> > Annoyingly, I can't reproduce this on my disparate other machines >> > here, suggesting it's maybe(?) timing related. > >Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure >prints a warning to dmesg? If so, can you share that warning? I mean the machine locks hard - X stops updating, the mouse/keyboard stop responding. No pings, etc. When I reboot, there's nothing in the logs. >If you don't have any way to see console output, and you don't have a >working serial console setup or such, you may want to try re-running >those tests while the kernel is booted with netconsole enabled to log >to a different machine over UDP (see >https://www.kernel.org/doc/Documentation/networking/netconsole.txt). ACK, will try that now for you. >You may want to try setting the sysctl kernel.sysrq=1 , then when the >system has locked up, press ALT+PRINT+L (to generate stack traces for >all active CPUs from NMI context), and maybe also ALT+PRINT+T and >ALT+PRINT+W (to collect more information about active tasks). Nod. >(If you share stack traces from these things with us, it would be >helpful if you could run them through scripts/decode_stacktrace.pl >from the kernel tree first, to add line number information.) ACK. >Trying to isolate the problem: > >__end_current_label_crit_section and end_current_label_crit_section >are aliases of each other (via #define), so that line change can't >have done anything. > >That leaves two possibilities AFAICS: > - the might_sleep() call by itself is causing issues for one of the >remaining users of begin_current_label_crit_section() (because it >causes preemption to happen more often where it didn't happen on >PREEMPT_VOLUNTARY before, or because it's trying to print a warning >message with the runqueue lock held, or something like that) > - the lack of "if (aa_replace_current_label(label) == 0) >aa_put_label(label);" in __begin_current_label_crit_section() is >somehow causing issues > >You could try to see whether just adding the might_sleep(), or just >replacing the begin_current_label_crit_section() call with >__begin_current_label_crit_section(), triggers the bug. > > >If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if >that isn't already set in your kernel config -, that might help track >down the problem, unless it magically makes the problem stop >triggering (which I guess would be conceivable if this indeed is a >race). OK, will try that second... -- Steve McIntyre, Cambridge, UK.st...@einval.com "I can't ever sleep on planes ... call it irrational if you like, but I'm afraid I'll miss my stop" -- Vivek Das Mohapatra
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
On Fri, Jun 26, 2020 at 02:45:18PM +0100, Steve McIntyre wrote: >Hey Greg, > >On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote: >>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: >>> >>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit >>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233 >>> Author: Jann Horn >>> Date: Thu Sep 13 18:12:09 2018 +0200 > >... > >>> Annoyingly, I can't reproduce this on my disparate other machines >>> here, suggesting it's maybe(?) timing related. >>> >>> Hope this helps - happy to give more information, test things, etc. >> >>So if you just revert this one patch, all works well? > >Correct. > >>I've added the authors of the patch to the cc: list... >> >>Also, does this problem happen on Linus's tree? > >Just building to check that now... Current head here (3e08a95294a4fb3702bb3d35ed08028433c37fe6) works fine. -- Steve McIntyre, Cambridge, UK.st...@einval.com The two hard things in computing: * naming things * cache invalidation * off-by-one errors -- Stig Sandbeck Mathisen
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
Hey Greg, On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote: >On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: >> >> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit >> commit e58f543fc7c0926f31a49619c1a3648e49e8d233 >> Author: Jann Horn >> Date: Thu Sep 13 18:12:09 2018 +0200 ... >> Annoyingly, I can't reproduce this on my disparate other machines >> here, suggesting it's maybe(?) timing related. >> >> Hope this helps - happy to give more information, test things, etc. > >So if you just revert this one patch, all works well? Correct. >I've added the authors of the patch to the cc: list... > >Also, does this problem happen on Linus's tree? Just building to check that now... -- Steve McIntyre, Cambridge, UK.st...@einval.com < Aardvark> I dislike C++ to start with. C++11 just seems to be handing rope-creating factories for users to hang multiple instances of themselves.
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote: > Hi folks, > > I'm the maintainer in Debian for strace. Trying to reproduce > https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've > found a repeatable hard lockup running the strace testsuite. Each time > it seems to have failed in a slightly different place in the testsuite > (suggesting it's not one particular syscall test that's triggering the > failure). I initially found this using Debian's current Buster kernel > (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1 > worked fine. > > I've bisected to find the failure point along the linux-4.19.y stable > branch and what I've got to is the following commit: > > e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit > commit e58f543fc7c0926f31a49619c1a3648e49e8d233 > Author: Jann Horn > Date: Thu Sep 13 18:12:09 2018 +0200 > > apparmor: don't try to replace stale label in ptrace access check > > [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ] > > As a comment above begin_current_label_crit_section() explains, > begin_current_label_crit_section() must run in sleepable context because > when label_is_stale() is true, aa_replace_current_label() runs, which uses > prepare_creds(), which can sleep. > Until now, the ptrace access check (which runs with a task lock held) > violated this rule. > > Also add a might_sleep() assertion to begin_current_label_crit_section(), > because asserts are less likely to be ignored than comments. > > Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") > Signed-off-by: Jann Horn > Signed-off-by: John Johansen > Signed-off-by: Sasha Levin > > :04 04 ca92f885a38c1747b812116f19de6967084a647e > 865a227665e460e159502f21e8a16e6fa590bf50 M security > > Considering I'm running strace build tests to provoke this bug, > finding the failure in a commit talking about ptrace changes does look > very suspicious...! > > Annoyingly, I can't reproduce this on my disparate other machines > here, suggesting it's maybe(?) timing related. > > Hope this helps - happy to give more information, test things, etc. So if you just revert this one patch, all works well? I've added the authors of the patch to the cc: list... Also, does this problem happen on Linus's tree? thanks, greg k-h
Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
Hi folks, I'm the maintainer in Debian for strace. Trying to reproduce https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've found a repeatable hard lockup running the strace testsuite. Each time it seems to have failed in a slightly different place in the testsuite (suggesting it's not one particular syscall test that's triggering the failure). I initially found this using Debian's current Buster kernel (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1 worked fine. I've bisected to find the failure point along the linux-4.19.y stable branch and what I've got to is the following commit: e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit commit e58f543fc7c0926f31a49619c1a3648e49e8d233 Author: Jann Horn Date: Thu Sep 13 18:12:09 2018 +0200 apparmor: don't try to replace stale label in ptrace access check [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ] As a comment above begin_current_label_crit_section() explains, begin_current_label_crit_section() must run in sleepable context because when label_is_stale() is true, aa_replace_current_label() runs, which uses prepare_creds(), which can sleep. Until now, the ptrace access check (which runs with a task lock held) violated this rule. Also add a might_sleep() assertion to begin_current_label_crit_section(), because asserts are less likely to be ignored than comments. Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels") Signed-off-by: Jann Horn Signed-off-by: John Johansen Signed-off-by: Sasha Levin :04 04 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security Considering I'm running strace build tests to provoke this bug, finding the failure in a commit talking about ptrace changes does look very suspicious...! Annoyingly, I can't reproduce this on my disparate other machines here, suggesting it's maybe(?) timing related. Hope this helps - happy to give more information, test things, etc. -- Steve McIntyre, Cambridge, UK.st...@einval.com "Managing a volunteer open source project is a lot like herding kittens, except the kittens randomly appear and disappear because they have day jobs." -- Matt Mackall