On Mon, 2021-03-01 at 08:43 +0000, [email protected] wrote:
> On Mon, Mar 01, 2021 at 08:37:46AM +0000, Richard Purdie wrote:
> > On Mon, 2021-03-01 at 07:59 +0000, [email protected] wrote:
> > > On Fri, Feb 26, 2021 at 05:47:52PM +0000, Richard Purdie wrote:
> > > 
> > > 
> > > Based on comment:
> > > 
> > >     # tests can be heavy on IO and if bitbake can't write out its caches, 
> > > we see timeouts.
> > >     # call sync around the tests to ensure the IO queue doesn't get too 
> > > large, taking any IO
> > >     # hit here rather than in bitbake shutdown.
> > >     if sync:
> > >         p = os.environ['PATH']
> > >         os.environ['PATH'] = "/usr/bin:/bin:/usr/sbin:/sbin:" + p
> > >         os.system("sync")
> > >         os.environ['PATH'] = p
> > > 
> > > this looks like a workaround for some other bug.
> > 
> > The wider issue is the randomly failing ptests and other issues with the 
> > automated
> > testing, particularly things running under qemu system mode. Those seemed 
> > partly to 
> > be caused by huge IO queues building up. The idea of the sync calls was to 
> > keep the 
> > UI backlog manageable.
> > 
> > In most cases we execute occasional commands so this should work/help. In 
> > the case of
> > the reproducible ptest, we execute a cmp on every generated package which 
> > really didn't
> > work well hence it makes sense to disable it there.
> > 
> > We don't see this issue with any other tests as far as I've observed. I 
> > really don't
> > want to make the runtime testing results worse, anyone who's looked at swat 
> > or attended
> > bug triage will know how much trouble we're having with the issues this 
> > code helps
> > mitigate :/.
> 
> Ok, understood. You need to keep the system and tests running.
> 
> But IO flushes can take any amount of time. So sounds like timeouts are
> too short, or should not exists at all. Also, systems are doing too much
> parallel work or are actually short on RAM if slow IO to disk starts
> failing things.

I'm far from happy with where we're at with the autobuilder intermittent 
failures. 
Along with the bug triage team, I've spent the best part of a couple of years 
trying
to get to grips with them. The sync work around does help to a degree and that 
is in
itself helpful and reduces the load on build failure handling.

We do already have a ton of memory in the workers. We're reluctant to reduce 
the 
parallelism numbers as in general things work well, we 'just' see a single 
failure
in every other build (where the builds report 2 million test results over 4 
arches 
each in 32+64 bit, different init systems and so on).

The "timeouts" are things like kernel rcu failures in guests, bitbake event test
issues (where a 5s timeout increased to over 60s doesn't help). Another example 
is
valgrind's ptests. Where possible we're trying to rewrite not to have the 
timeouts.

I'd love some help in getting to the bottom of the issues but we have already 
tried the simpler stuff sadly :(. The issues are very very rare but annoying 
when
they do happen.

Cheers,

Richard


-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#148791): 
https://lists.openembedded.org/g/openembedded-core/message/148791
Mute This Topic: https://lists.openembedded.org/mt/80933358/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to