Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Sasha Levin

On Fri, Jun 26, 2020 at 08:01:18PM +0100, Steve McIntyre wrote:

On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote:

On 6/26/20 9:50 AM, Steve McIntyre wrote:


OK, will try that second...



I have not been able to reproduce but

So looking at linux-4.19.y it looks like
1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check

was picked without
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

Both of them are marked as
Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")

so I would expect them to be picked together.

ptraceme is potentially updating the task's cred while the access check is
running.

Try building after picking
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check


Bingo! With that one change the test suite runs to completion, no lockup.


I've queued ca3fde5214e1 for 4.19 and 4.14, thank you for your work!

--
Thanks,
Sasha



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote:
>On 6/26/20 9:50 AM, Steve McIntyre wrote:
>> 
>> OK, will try that second...
>> 
>
>I have not been able to reproduce but
>
>So looking at linux-4.19.y it looks like
>1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check
>
>was picked without
>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check
>
>Both of them are marked as
>Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
>
>so I would expect them to be picked together.
>
>ptraceme is potentially updating the task's cred while the access check is
>running.
>
>Try building after picking
>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

Bingo! With that one change the test suite runs to completion, no lockup.

\o/

Thanks guys, I think we've found the cause here.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
"The whole problem with the world is that fools and fanatics are
 always so certain of themselves, and wiser people so full of doubts."
   -- Bertrand Russell



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
Hi again,

On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
>On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>>On Fri, Jun 26, 2020 at 3:41 PM Greg KH  wrote:
>>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>
>...
>
>>> > Considering I'm running strace build tests to provoke this bug,
>>> > finding the failure in a commit talking about ptrace changes does look
>>> > very suspicious...!
>>> >
>>> > Annoyingly, I can't reproduce this on my disparate other machines
>>> > here, suggesting it's maybe(?) timing related.
>>
>>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>>prints a warning to dmesg? If so, can you share that warning?
>
>I mean the machine locks hard - X stops updating, the mouse/keyboard
>stop responding. No pings, etc. When I reboot, there's nothing in the
>logs.
>
>>If you don't have any way to see console output, and you don't have a
>>working serial console setup or such, you may want to try re-running
>>those tests while the kernel is booted with netconsole enabled to log
>>to a different machine over UDP (see
>>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
>
>ACK, will try that now for you.
>
>>You may want to try setting the sysctl kernel.sysrq=1 , then when the
>>system has locked up, press ALT+PRINT+L (to generate stack traces for
>>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>>ALT+PRINT+W (to collect more information about active tasks).
>
>Nod.
>
>>(If you share stack traces from these things with us, it would be
>>helpful if you could run them through scripts/decode_stacktrace.pl
>>from the kernel tree first, to add line number information.)
>
>ACK.

Output passed through scripts/decode_stacktrace.sh attached.

Just about to try John's suggestion next.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
Dance like no one's watching. Encrypt like everyone is.
 - @torproject


tack.log.decoded.gz
Description: application/gzip


Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread John Johansen
On 6/26/20 9:50 AM, Steve McIntyre wrote:
> Hi Jann,
> 
> On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>> On Fri, Jun 26, 2020 at 3:41 PM Greg KH  wrote:
>>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> 
> ...
> 
 Considering I'm running strace build tests to provoke this bug,
 finding the failure in a commit talking about ptrace changes does look
 very suspicious...!

 Annoyingly, I can't reproduce this on my disparate other machines
 here, suggesting it's maybe(?) timing related.
>>
>> Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>> prints a warning to dmesg? If so, can you share that warning?
> 
> I mean the machine locks hard - X stops updating, the mouse/keyboard
> stop responding. No pings, etc. When I reboot, there's nothing in the
> logs.
> 
>> If you don't have any way to see console output, and you don't have a
>> working serial console setup or such, you may want to try re-running
>> those tests while the kernel is booted with netconsole enabled to log
>> to a different machine over UDP (see
>> https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
> 
> ACK, will try that now for you.
> 
>> You may want to try setting the sysctl kernel.sysrq=1 , then when the
>> system has locked up, press ALT+PRINT+L (to generate stack traces for
>> all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>> ALT+PRINT+W (to collect more information about active tasks).
> 
> Nod.
> 
>> (If you share stack traces from these things with us, it would be
>> helpful if you could run them through scripts/decode_stacktrace.pl
>>from the kernel tree first, to add line number information.)
> 
> ACK.
> 
>> Trying to isolate the problem:
>>
>> __end_current_label_crit_section and end_current_label_crit_section
>> are aliases of each other (via #define), so that line change can't
>> have done anything.
>>
>> That leaves two possibilities AFAICS:
>> - the might_sleep() call by itself is causing issues for one of the
>> remaining users of begin_current_label_crit_section() (because it
>> causes preemption to happen more often where it didn't happen on
>> PREEMPT_VOLUNTARY before, or because it's trying to print a warning
>> message with the runqueue lock held, or something like that)
>> - the lack of "if (aa_replace_current_label(label) == 0)
>> aa_put_label(label);" in __begin_current_label_crit_section() is
>> somehow causing issues
>>
>> You could try to see whether just adding the might_sleep(), or just
>> replacing the begin_current_label_crit_section() call with
>> __begin_current_label_crit_section(), triggers the bug.
>>
>>
>> If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if
>> that isn't already set in your kernel config -, that might help track
>> down the problem, unless it magically makes the problem stop
>> triggering (which I guess would be conceivable if this indeed is a
>> race).
> 
> OK, will try that second...
> 

I have not been able to reproduce but

So looking at linux-4.19.y it looks like
1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check

was picked without
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

Both of them are marked as
Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")

so I would expect them to be picked together.


ptraceme is potentially updating the task's cred while the access check is
running.

Try building after picking
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
Hi Jann,

On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>On Fri, Jun 26, 2020 at 3:41 PM Greg KH  wrote:
>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:

...

>> > Considering I'm running strace build tests to provoke this bug,
>> > finding the failure in a commit talking about ptrace changes does look
>> > very suspicious...!
>> >
>> > Annoyingly, I can't reproduce this on my disparate other machines
>> > here, suggesting it's maybe(?) timing related.
>
>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>prints a warning to dmesg? If so, can you share that warning?

I mean the machine locks hard - X stops updating, the mouse/keyboard
stop responding. No pings, etc. When I reboot, there's nothing in the
logs.

>If you don't have any way to see console output, and you don't have a
>working serial console setup or such, you may want to try re-running
>those tests while the kernel is booted with netconsole enabled to log
>to a different machine over UDP (see
>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).

ACK, will try that now for you.

>You may want to try setting the sysctl kernel.sysrq=1 , then when the
>system has locked up, press ALT+PRINT+L (to generate stack traces for
>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>ALT+PRINT+W (to collect more information about active tasks).

Nod.

>(If you share stack traces from these things with us, it would be
>helpful if you could run them through scripts/decode_stacktrace.pl
>from the kernel tree first, to add line number information.)

ACK.

>Trying to isolate the problem:
>
>__end_current_label_crit_section and end_current_label_crit_section
>are aliases of each other (via #define), so that line change can't
>have done anything.
>
>That leaves two possibilities AFAICS:
> - the might_sleep() call by itself is causing issues for one of the
>remaining users of begin_current_label_crit_section() (because it
>causes preemption to happen more often where it didn't happen on
>PREEMPT_VOLUNTARY before, or because it's trying to print a warning
>message with the runqueue lock held, or something like that)
> - the lack of "if (aa_replace_current_label(label) == 0)
>aa_put_label(label);" in __begin_current_label_crit_section() is
>somehow causing issues
>
>You could try to see whether just adding the might_sleep(), or just
>replacing the begin_current_label_crit_section() call with
>__begin_current_label_crit_section(), triggers the bug.
>
>
>If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if
>that isn't already set in your kernel config -, that might help track
>down the problem, unless it magically makes the problem stop
>triggering (which I guess would be conceivable if this indeed is a
>race).

OK, will try that second...

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
"I can't ever sleep on planes ... call it irrational if you like, but I'm
 afraid I'll miss my stop" -- Vivek Das Mohapatra



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
On Fri, Jun 26, 2020 at 02:45:18PM +0100, Steve McIntyre wrote:
>Hey Greg,
>
>On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote:
>>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>>> 
>>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
>>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
>>> Author: Jann Horn 
>>> Date:   Thu Sep 13 18:12:09 2018 +0200
>
>...
>
>>> Annoyingly, I can't reproduce this on my disparate other machines
>>> here, suggesting it's maybe(?) timing related.
>>> 
>>> Hope this helps - happy to give more information, test things, etc.
>>
>>So if you just revert this one patch, all works well?
>
>Correct.
>
>>I've added the authors of the patch to the cc: list...
>>
>>Also, does this problem happen on Linus's tree?
>
>Just building to check that now...

Current head here (3e08a95294a4fb3702bb3d35ed08028433c37fe6) works fine.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
The two hard things in computing:
 * naming things
 * cache invalidation
 * off-by-one errors  -- Stig Sandbeck Mathisen



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
Hey Greg,

On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote:
>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>> 
>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
>> Author: Jann Horn 
>> Date:   Thu Sep 13 18:12:09 2018 +0200

...

>> Annoyingly, I can't reproduce this on my disparate other machines
>> here, suggesting it's maybe(?) timing related.
>> 
>> Hope this helps - happy to give more information, test things, etc.
>
>So if you just revert this one patch, all works well?

Correct.

>I've added the authors of the patch to the cc: list...
>
>Also, does this problem happen on Linus's tree?

Just building to check that now...

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
< Aardvark> I dislike C++ to start with. C++11 just seems to be
handing rope-creating factories for users to hang multiple
instances of themselves.



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Greg KH
On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> Hi folks,
> 
> I'm the maintainer in Debian for strace. Trying to reproduce
> https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
> found a repeatable hard lockup running the strace testsuite. Each time
> it seems to have failed in a slightly different place in the testsuite
> (suggesting it's not one particular syscall test that's triggering the
> failure). I initially found this using Debian's current Buster kernel
> (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
> worked fine.
> 
> I've bisected to find the failure point along the linux-4.19.y stable
> branch and what I've got to is the following commit:
> 
> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
> Author: Jann Horn 
> Date:   Thu Sep 13 18:12:09 2018 +0200
> 
> apparmor: don't try to replace stale label in ptrace access check
> 
> [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]
> 
> As a comment above begin_current_label_crit_section() explains,
> begin_current_label_crit_section() must run in sleepable context because
> when label_is_stale() is true, aa_replace_current_label() runs, which uses
> prepare_creds(), which can sleep.
> Until now, the ptrace access check (which runs with a task lock held)
> violated this rule.
> 
> Also add a might_sleep() assertion to begin_current_label_crit_section(),
> because asserts are less likely to be ignored than comments.
> 
> Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
> Signed-off-by: Jann Horn 
> Signed-off-by: John Johansen 
> Signed-off-by: Sasha Levin 
> 
> :04 04 ca92f885a38c1747b812116f19de6967084a647e 
> 865a227665e460e159502f21e8a16e6fa590bf50 M security
> 
> Considering I'm running strace build tests to provoke this bug,
> finding the failure in a commit talking about ptrace changes does look
> very suspicious...!
> 
> Annoyingly, I can't reproduce this on my disparate other machines
> here, suggesting it's maybe(?) timing related.
> 
> Hope this helps - happy to give more information, test things, etc.

So if you just revert this one patch, all works well?

I've added the authors of the patch to the cc: list...

Also, does this problem happen on Linus's tree?

thanks,

greg k-h



Bug#963493: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards

2020-06-26 Thread Steve McIntyre
Hi folks,

I'm the maintainer in Debian for strace. Trying to reproduce
https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
found a repeatable hard lockup running the strace testsuite. Each time
it seems to have failed in a slightly different place in the testsuite
(suggesting it's not one particular syscall test that's triggering the
failure). I initially found this using Debian's current Buster kernel
(4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
worked fine.

I've bisected to find the failure point along the linux-4.19.y stable
branch and what I've got to is the following commit:

e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
commit e58f543fc7c0926f31a49619c1a3648e49e8d233
Author: Jann Horn 
Date:   Thu Sep 13 18:12:09 2018 +0200

apparmor: don't try to replace stale label in ptrace access check

[ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]

As a comment above begin_current_label_crit_section() explains,
begin_current_label_crit_section() must run in sleepable context because
when label_is_stale() is true, aa_replace_current_label() runs, which uses
prepare_creds(), which can sleep.
Until now, the ptrace access check (which runs with a task lock held)
violated this rule.

Also add a might_sleep() assertion to begin_current_label_crit_section(),
because asserts are less likely to be ignored than comments.

Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
Signed-off-by: Jann Horn 
Signed-off-by: John Johansen 
Signed-off-by: Sasha Levin 

:04 04 ca92f885a38c1747b812116f19de6967084a647e 
865a227665e460e159502f21e8a16e6fa590bf50 M security

Considering I'm running strace build tests to provoke this bug,
finding the failure in a commit talking about ptrace changes does look
very suspicious...!

Annoyingly, I can't reproduce this on my disparate other machines
here, suggesting it's maybe(?) timing related.

Hope this helps - happy to give more information, test things, etc.

-- 
Steve McIntyre, Cambridge, UK.st...@einval.com
"Managing a volunteer open source project is a lot like herding
 kittens, except the kittens randomly appear and disappear because they
 have day jobs." -- Matt Mackall