On Fri, 24 Apr 2020 10:39:26 +0200 Stephen Berman <[email protected]> 
wrote:

> On Sat, 18 Apr 2020 11:22:36 -0500 Bruce Dubbs <[email protected]> wrote:
>
>> On 4/18/20 6:30 AM, Stephen Berman wrote:
>>> On Sat, 18 Apr 2020 03:08:46 -0400 Michael Shell <[email protected]> 
>>> wrote:
>>>
>>>> On Wed, 15 Apr 2020 10:41:27 +0200
>>>> Stephen Berman <[email protected]> wrote:
>>
>>>> Another thing to try - does the problem persist if the CDROM is
>>>> physically disconnected from the system (e.g., its SATA cable
>>>> disconnected and then the system powered up)? If the system
>>>> just hangs at some other point, then that is good evidence that
>>>> something other than the CDROM driver/SCSI system is to blame.
>>> Thanks for the feedback.  I see if I can try your suggestions and see
>>> what happens.
>>
>> Another suggestion would be to build a problematic kernel with
>> CONFIG_CDROM=n to see if the issue is still present.
>
> I guess you mean building *without* CONFIG_CDROM=y, since CONFIG_CDROM=n
> doesn't seem to be a valid Kconfig setting.  In order to remove
> CONFIG_CDROM=y I had to unset CONFIG_BLK_DEV_SR (I couldn't see how to
> do this with `make menuconfig' with the existing .config loaded, but I
> could do it with `make xconfig').  Anyway, I did that with the
> problematic kernel 5.3.0, rebuilt and reinstalled the kernel, booted it,
> confirmed that there was no /dev/cdrom, ran startx, emacs, firefox,
> exited these, ran `shutdown -h now', and as previously with this kernel,
> after the loopback interface message there was nothing more, and after
> waiting more than two minutes I pressed the restart button.
>
> So it looks like detaching the cdrom is not the problem after all.
>
> On Wed, 22 Apr 2020 01:43:18 -0400 Michael Shell <[email protected]> 
> wrote:
>
>> If that confirms it is the cdrom, then you have to bisect until
>> you find the specific change in the driver that created the
>> issue.
>
> Although the cdrom is evidently not the problem, I guess I've reached
> the point where bisecting is the next thing to try.

I've completed the bisection of the mainline kernel between the good
v5.1 and the bad v5.2, and here's the result:

6d25be5782e482eb93e3de0c94d0a517879377d0 is the first bad commit
commit 6d25be5782e482eb93e3de0c94d0a517879377d0
Author: Thomas Gleixner <[email protected]>
Date:   Wed Mar 13 17:55:48 2019 +0100

    sched/core, workqueues: Distangle worker accounting from rq lock

    The worker accounting for CPU bound workers is plugged into the core
    scheduler code and the wakeup code. This is not a hard requirement and
    can be avoided by keeping track of the state in the workqueue code
    itself.

    Keep track of the sleeping state in the worker itself and call the
    notifier before entering the core scheduler. There might be false
    positives when the task is woken between that call and actually
    scheduling, but that's not really different from scheduling and being
    woken immediately after switching away. When nr_running is updated when
    the task is retunrning from schedule() then it is later compared when it
    is done from ttwu().

    [ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot 
de Oliveira ]

    Signed-off-by: Thomas Gleixner <[email protected]>
    Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
    Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
    Acked-by: Tejun Heo <[email protected]>
    Cc: Daniel Bristot de Oliveira <[email protected]>
    Cc: Lai Jiangshan <[email protected]>
    Cc: Linus Torvalds <[email protected]>
    Cc: Peter Zijlstra <[email protected]>
    Link: 
http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bris...@redhat.com
    Signed-off-by: Ingo Molnar <[email protected]>

 kernel/sched/core.c         | 88 +++++++++++----------------------------------
 kernel/workqueue.c          | 54 +++++++++++++---------------
 kernel/workqueue_internal.h |  5 +--
 3 files changed, 48 insertions(+), 99 deletions(-)

I hope I did the bisection correctly.  According to git there are almost
7000 commits between the v5.1 and v5.2 kernels, resulting in a 13-step
bisection.  In each step I first ran `make oldconfig' using the 5.1.0
config and accepting the defaults of all new options, then built and
installed the kernel.  I tested by booting the kernel, then running
startx (openbox), emacs and firefox, exiting all of these and then
running `shutdown -h now' in the tty.  Here's a summary of the results:

1st test: power off 45 seconds after the message "Bringing down the
loopback interface"; I told git this is bad, though it was a shorter
hang than most others I've experienced.
2nd test: power off 1'56" after loopback message (bad).
3rd test: power off 2'03" after loopback message (bad).
4th test: power off 1'47" after loopback message (bad).
5th test: waited more than 3 minutes after loopback message, no power
off, pressed restart button (bad).
6th test: power off within 4 seconds after loopback message (good).
7th test: power off after 1'28" after loopback message (bad).
8th test: power off within 4 seconds after loopback message (good).
9th test: power off 2'50" after loopback message (bad).
10th test: power off 2'06" after loopback message (bad).
11th test: power off within 4 seconds after loopback message (good).
12th test: power off 2'23" after loopback message (bad).
13th test: power off within 4 seconds after loopback message (good).

It's striking that in all but one test the machine powered off in less
than 3 minutes; as I've reported here, with kernels 5.2.0, 5.3.0, 5.5.9
and 5.6.4 the hang has been longer, so that I've considered it
indefinite and always ended up pressing the restart button (except for
several tests where I remained in the tty and either immediately ran
`shutdown -h' or, in one case, `more': then it powered off in ~20
seconds).  Maybe a later commit has made the problem worse.

I guess I should report this to the kernel bug list?  I searched that
list for power-off issues and found only one report after the above
commit, and it seems to be a different issue.  It is rather surprising
that my machine seems to be the only one with this problem.

Steve Berman
-- 
http://lists.linuxfromscratch.org/listinfo/lfs-support
FAQ: http://www.linuxfromscratch.org/blfs/faq.html
Unsubscribe: See the above information page

Do not top post on this list.

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

http://en.wikipedia.org/wiki/Posting_style

Reply via email to