[OE-core] Dilemma on changes - merge or not to merge (e.g. 6.4)

Richard Purdie Mon, 14 Aug 2023 02:54:50 -0700

I'm becoming a little weary/wary of some of the changes that are coming
in. The challenge is that once they merge, issues become the problem of
a very small number of people.

My current dilemma is the 6.4 kernel. People would like it, we'd really
ideally use it for the next release but there are issues.

I've worked through a few, at least pinning down where the issues were
then resolving them with the help of others (thanks Bruce, Jon, Ross).

Remaining are:
* an error upon boot on preempt-rt on qemux86-64
(e.g.
https://autobuilder.yoctoproject.org/typhoon/#/builders/72/builds/7616/steps/36/logs/stdio)
We'll probably just have to ignore it in parselogs as it has been
around for a while and nobody seems interested in fixing it upstream.
* some random hangs:

https://autobuilder.yoctoproject.org/typhoon/#/builders/148/builds/349/steps/12/logs/stdio

https://autobuilder.yoctoproject.org/typhoon/#/builders/148/builds/354/steps/12/logs/stdio

The latter are rare and intermittent, mainly taking out CI test builds.
Most people aren't affected by them, find them hard to reproduce let
alone fix and will ignore them. That will leave me/Bruce/PaulG holding
the pieces.

I know Bruce spends a ton of time debugging weird things just to get
the kernel to the point we can even consider merging and nobody ever
really sees or appreciates that work :(.

Systemd was a similar challenge recently, multiple patches causing
multiple issues with a significant impact on CI. In that case the
issues weren't intermittent so resolution wasn't so bad.

Rust and reproducibility was given a pass so the rest of the changes
could merge for it. That just meant there was less pressure and the
reproducibility issue is still there with people saying its too hard.
That issue is now spreading down the chain to other recipes.

The toolchain test reports have thousands of failures nobody is really
looking at. Similarly the now consistent ltp controllers failures
(previously the reports weren't even consistent!).

I'm worried the access control patches changing the tar format are
going to destablise and once merged, people will move on to other
things leaving any remaining intermittent issues to me. Already we're
seeing things like sstate being blamed as it is easiest to do that. I
end up having to "prove" it isn't that.

There are intermittent ptests on the autobuilder too. I took mdadm
ptest patches on the basis there was help to fix them. We are still see
a lot of failures in CI from there. The glib-networking intermittent
failures continue, I know Trevor has tried to dig into those but he is
alone in doing it in code which isn't easy to navigate (and I don't
know how to help there).

As an idea of impact, every time one of these things fails in CI,
someone has triage that failure. The bug triage team has to triage the
bugs too.

I don't know how we fix this but we really could do with more people
able to dive in and help with these intermittent issues. I'm really
really apprehensive about merging some patches as I can just tell
they're going to cause pain :(.

Cheers,

Richard

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#185930): 
https://lists.openembedded.org/g/openembedded-core/message/185930
Mute This Topic: https://lists.openembedded.org/mt/100733646/21656
Group Owner: [email protected]
Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub 
[[email protected]]
-=-=-=-=-=-=-=-=-=-=-=-

[OE-core] Dilemma on changes - merge or not to merge (e.g. 6.4)

Reply via email to