Hi,

On Tue, Dec 19, 2023 at 12:03:07PM +0000, Richard Purdie wrote:
> On Mon, 2023-12-18 at 23:01 +0000, Richard Purdie via
> lists.openembedded.org wrote:
> > On Mon, 2023-12-18 at 10:07 -0800, Khem Raj wrote:
> > > On Mon, Dec 18, 2023 at 9:58 AM Richard Purdie
> > > <[email protected]> wrote:
> > > > 
> > > > On Mon, 2023-12-18 at 09:45 -0800, Khem Raj wrote:
> > > > > I tried the two patches in this series. It did improve the situation
> > > > > but I am still getting SSH timeouts. But this time its 13 tests
> > > > > earlier it used to be 40+
> > > > > btw. my images are using systemd. So it might be good to see if we see
> > > > > this with poky-altcfg as well or not.
> > > > 
> > > > Do you have the log.do_testimage and the ${WORKDIR}/testimage/ files?
> > > 
> > > yes, further I ran the failing tests in loop one after another still
> > > one test gzip fails with ssh timeouts
> > > 
> > > https://busybox.net/~kraj/log.do_testimage.503
> > > https://busybox.net/~kraj/testimage/
> > > 
> > > there are two runs in the testimages folder. In one you see the RCU
> > > stall and in second you do not
> > > but it fails with same ssh timeout issue.
> > > 
> > > > 
> > > > Did you still see rcu stalls in the logs?
> > 
> > What is interesting is there is ~3MB of nulls in the .2 serial log. The
> > rcu stall is also:
> > 
> > [   88.261687]  serial8250_tx_chars+0xea/0x2b0
> > [   88.261689]  serial8250_handle_irq+0x1e9/0x330
> > [   88.261691]  serial8250_default_handle_irq+0x4a/0x90
> > [   88.261693]  serial8250_interrupt+0x66/0xc0
> > [   88.261696]  __handle_irq_event_percpu+0x54/0x1c0
> > [   88.261701]  handle_irq_event+0x3d/0x80
> > 
> > i.e. it is stalled in the serial TX path.
> > 
> > The big question is why is there so many nulls on the serial port. I
> > see a few on my local x86 test runs but only ~4kb, not megabytes of
> > them. I hadn't worked out where/what they are from yet.
> > 
> > I suspect something in the serial/kernel/qemu space isn't interacting
> > correctly.
> > 
> > Find the cause of the nulls and we might make progress on this.
> 
> I went digging and it is a mess. That 3MB file is actually a wtmp file
> from /var/log. The reason is a failure of the "target_dumper" commands
> and that the simplistic serial command handling doesn't like binary
> files.
> 
> I've a few patches which should improve the situation but we're clearly
> multiple issues at play here where issues are stacking on top of other
> issues. My rough list of issues:
> 
> a) stdout logging from qemu wasn't being read and could overflow
> buffers locking qemu
> b) serial data from commands wasn't being fully read and could overflow
> buffers locking qemu
> c) the dumper log reading command is reading binary files like wtmp
> d) the dumper logs overwrite each other so are useless anyway
> e) the dumper logs run on every failed command
> f) qmp monitor dump command also run on every failed command
> 
> I think we'd be much better off if we drop the dumper/monitor stuff and
> qemu runs might be quite a bit faster too.

I agree to this. If target has serious issues like a full lockup, then dumper 
and
monitor commands don't help anyway. Capturing the full serial logs would be 
sufficient,
though I'd like to see a live capture of /var/log/message and systemd journal 
in case
services started by init have major issues.

Cheers,

-Mikko
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#192725): 
https://lists.openembedded.org/g/openembedded-core/message/192725
Mute This Topic: https://lists.openembedded.org/mt/103244414/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to