Hi, On Tue, Dec 19, 2023 at 12:03:07PM +0000, Richard Purdie wrote: > On Mon, 2023-12-18 at 23:01 +0000, Richard Purdie via > lists.openembedded.org wrote: > > On Mon, 2023-12-18 at 10:07 -0800, Khem Raj wrote: > > > On Mon, Dec 18, 2023 at 9:58 AM Richard Purdie > > > <[email protected]> wrote: > > > > > > > > On Mon, 2023-12-18 at 09:45 -0800, Khem Raj wrote: > > > > > I tried the two patches in this series. It did improve the situation > > > > > but I am still getting SSH timeouts. But this time its 13 tests > > > > > earlier it used to be 40+ > > > > > btw. my images are using systemd. So it might be good to see if we see > > > > > this with poky-altcfg as well or not. > > > > > > > > Do you have the log.do_testimage and the ${WORKDIR}/testimage/ files? > > > > > > yes, further I ran the failing tests in loop one after another still > > > one test gzip fails with ssh timeouts > > > > > > https://busybox.net/~kraj/log.do_testimage.503 > > > https://busybox.net/~kraj/testimage/ > > > > > > there are two runs in the testimages folder. In one you see the RCU > > > stall and in second you do not > > > but it fails with same ssh timeout issue. > > > > > > > > > > > Did you still see rcu stalls in the logs? > > > > What is interesting is there is ~3MB of nulls in the .2 serial log. The > > rcu stall is also: > > > > [ 88.261687] serial8250_tx_chars+0xea/0x2b0 > > [ 88.261689] serial8250_handle_irq+0x1e9/0x330 > > [ 88.261691] serial8250_default_handle_irq+0x4a/0x90 > > [ 88.261693] serial8250_interrupt+0x66/0xc0 > > [ 88.261696] __handle_irq_event_percpu+0x54/0x1c0 > > [ 88.261701] handle_irq_event+0x3d/0x80 > > > > i.e. it is stalled in the serial TX path. > > > > The big question is why is there so many nulls on the serial port. I > > see a few on my local x86 test runs but only ~4kb, not megabytes of > > them. I hadn't worked out where/what they are from yet. > > > > I suspect something in the serial/kernel/qemu space isn't interacting > > correctly. > > > > Find the cause of the nulls and we might make progress on this. > > I went digging and it is a mess. That 3MB file is actually a wtmp file > from /var/log. The reason is a failure of the "target_dumper" commands > and that the simplistic serial command handling doesn't like binary > files. > > I've a few patches which should improve the situation but we're clearly > multiple issues at play here where issues are stacking on top of other > issues. My rough list of issues: > > a) stdout logging from qemu wasn't being read and could overflow > buffers locking qemu > b) serial data from commands wasn't being fully read and could overflow > buffers locking qemu > c) the dumper log reading command is reading binary files like wtmp > d) the dumper logs overwrite each other so are useless anyway > e) the dumper logs run on every failed command > f) qmp monitor dump command also run on every failed command > > I think we'd be much better off if we drop the dumper/monitor stuff and > qemu runs might be quite a bit faster too.
I agree to this. If target has serious issues like a full lockup, then dumper and monitor commands don't help anyway. Capturing the full serial logs would be sufficient, though I'd like to see a live capture of /var/log/message and systemd journal in case services started by init have major issues. Cheers, -Mikko
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#192725): https://lists.openembedded.org/g/openembedded-core/message/192725 Mute This Topic: https://lists.openembedded.org/mt/103244414/21656 Group Owner: [email protected] Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [[email protected]] -=-=-=-=-=-=-=-=-=-=-=-
