On Mon, 2023-10-09 at 13:05 +0300, Mikko Rapeli wrote:
> Yocto Linux distro maintainer Richard Purdie is seeing a regression or 
> behavior
> change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated 
> tests with qemu
> where a python test framework configures and spawns qemu (version 8.1) with 
> two
> serial ports and boots a newly built kernel and rootfs there. The python test 
> framework
> tries to detect getty login prompt from ttyS1 while boot time logging
> from qemu, kernel, userspace etc happens on serial console ttyS0. This has 
> worked
> reliably for years and across multiple kernel versions. Now with 6.5 kernel 
> there are
> suddenly frequent failures where ttyS1 doesn't show the login prompt in the 
> python code
> which listens on the qemu socket from target system ttyS1. The target system 
> ttyS0 is working
> and boot time logs and getty prompt are visible there. The ttyS1 logs from 
> test framework show
> that only two characters CR and LF make it through. This same happens on 
> x86_64 host
> running qemu machine for x86_64 and arm64 targets.
> 
> The kernel boot logs from target system with 6.5.5 don't show any errors and 
> getty
> processes are started correctly and they don't see any errors either. Data 
> from
> target system userspace is being written, but seems to sometimes hang inside 
> kernel
> or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably 
> all
> the time.
> 
> Example target logs with 6.5.5 via ttyS0:
> 
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853
> or https://pastebin.com/raw/jRRa2CwW
> 
> ttyS1 logs from the same run:
> https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2
> 
> Kernel config:
> https://pastebin.com/raw/Lv9Hfeuh
> 
> Do you have any ideas or hints how to debug or possibly fix this?
> 
> tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and 
> we're not able
> to pinpoint or bisect the issue, since it so annoyingly rare.
> 
> Some more details from Richard below.

Brief summary:

We're seeing an issue on x86_64 with 6.5.X where data appears to be
left in the transmission buffer and not sent to the port on the second
serial port (ttyS1) until we trigger it with intervention.

Paul Gortmaker did some painful bisection over a few days down to:

serial: core: Start managing serial controllers to enable runtime PM
https://lore.kernel.org/linux-serial/[email protected]/T/#t

More details:

We run very simple plain images under qemu with two serial ports
configured. These are x86_64 images with two 16550A ports with the
standard x86 port addresses which appear as ttyS0 and ttyS1, nothing
special. We run a console and getty on ttyS0 and a getty on ttyS1.
After boot, we wait for a getty to appear on ttyS1, login to it and run
commands.

With 6.5.5 (and latest 6.5.X head as of yesterday), we see an issue say
1% of the time where the getty never appears on ttyS1 even after our
timeout of 1000s.

When this happens we've added code to login to the ttyS0 getty and run
debug commands. We've been able to confirm the getty is running and the
init system doesn't matter (happens with sysvinit and systemd). The
most interesting debug I've seen is this:

root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3
root@qemux86-64:~# echo helloA > /dev/ttyS1
root@qemux86-64:~# echo helloB > /dev/ttyS0
helloB
root@qemux86-64:~# cat /proc/tty/driver/serial
serinfo:1.0 driver revision:
0: uart:16550A port:000003F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD
1: uart:16550A port:000002F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD
2: uart:unknown port:000003E8 irq:4
3: uart:unknown port:000002E8 irq:3

This is being run after the getty didn't appear for 60s on ttyS1 so
we've logged into ttyS0 and run these commands. We've seen that if it
doesn't appear after 60s, it won't appear after 1000s either.

The tx:249 is interesting as it should be tx:273, 273 being the number
of bytes our successful serial getty prompt has. Once we echo something
to the port (8 bytes), tx: jumps to 281, so it suddenly found our
missing login prompt. This is confirmed with the data appearing on the
port after the echo.

I did try disabling the autosuspend code in the commit above but it
made no difference.

Reproducing this is a pain. I thought I had a local setup doing it
fairly frequently, then spent an hour this morning without it happening
once.

I'm hoping someone might recognise what is going on here? I'm happy to
try and add debug or find further information if that would help narrow
it down.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#189112): 
https://lists.openembedded.org/g/openembedded-core/message/189112
Mute This Topic: https://lists.openembedded.org/mt/101849211/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to