Hi Lukas,

On Tue, Mar 23, 2021 at 12:21:02AM +0100, Lukas Tribus wrote:
> I agree that finding the sweet spot can be difficult, but I have to
> say I share Vincent's concerns. I do feel like currently we backport
> *a lot*, especially on those near-EOLed trees. When looking at the
> list of backported patches, I don't feel like the age and remaining
> lifetime is taken into consideration.

It usually is, but let's face it, we're all humans, and when you spend
a whole day doing backports from 2.3 to 1.6, it's not always trivial
to decide whether it's better to drop certain fixes, especially when
you remember having worked on them and know the trouble the bug they
fix can cause.

> I don't want to be the monday morning quarterback,

No, feel free to!

> but in 1.7 we have
> 853926a9ac ("BUG/MEDIUM: ebtree: use a byte-per-byte memcmp() to
> compare memory blocks") and I quote from the commit message:
> 
> > This is not correct and definitely confuses address sanitizers. In real
> > life the problem doesn't have visible consequences.
> > [...]
> > there is no way such a multi-byte access will cross a page boundary
> > and end up reading from an unallocated zone. This is why it was
> > never noticed before.
> 
> This sounds like a backport candidate for "warm" stable branches
> (maybe), but 1.7 and 1.8 feel too "cold" for this, even 8 - 9 months
> ago.

I agree in principle, but I'm pretty sure we've later met a situation
where it resulted in a read past the end of a node. I mean, given that
commits cannot be amended after they're merged, it's quite common to
later figure that a fix for a somewhat innocent bug used to fix a more
important one. This is exactly one of the problems that GregKH explained
in his talk about issues caused by CVE (https://youtu.be/HeeoTE9jLjM).
Normally we address this by adding a personal comment to the commit
during the backport.

> This backport specifically causes a build failure of 1.7.13 on musl
> (#760) - because of a missing backport, but that's just an example.

Yes but this is a good example. We are particularly careful about
such issues and sometimes wait longer to pick a series at once in
order to get a fix and its own fixes. But this does not always work
that well :-/  And we're trying to force ourselves not to backport
too fast to older versions. However we know that backporting a patch
to 5 versions is roughly the same amount of work as doing it for 2
as long as it's still hot in the developer's head, while it would
be 5 if done with a delay, so there's a sweet spot to find there.

We've even thought about using the -next branch in stable repos to
queue delayed fixes and wait for their own fixes, but that implies
doing other non-trivial changes to our workflow.

> 39
> "MINOR" patches made it into 1.6.16, 62 patches in 1.7.13. While it is
> true that a lot of "MINOR" tagged patches are actually important, this
> is still a large number for a tree that is supposed to die so soon.

I agree. But on the other hand many times we've found that we've left a
nasty bug because we didn't backport something. The problem here is in
fact the frequency of such releases. If we managed to emit them more
often with colder fixes, this would be much less of a concern. Right
now, patches are "ejected" once a year or so, and while some of them
are totally cold and save, others are much more recent and may present
a problem. I'm well accustomed to this issue, I was facing the same with
the extended LTS kernels. In theory everyone would love to see a sliding
window of patches, in practice some fixes are quite important and may
rely on other ones so it's not that easy to keep some out of the way.

> Very rarely do users build from source from such old trees anyway (and
> those users would be especially conservative, definitely not
> interested in generic, non-important improvements).

Sure but the goal is not to bring improvements there but to fix real
issues for which they upgrade. On the other hand we also know that such
users will not update their prod unless they face an issue. And when
you face an issue after 3-4 years of operations, it's rarely one tagged
major because it would have been spotted much earlier. Often it's a
corner case of a minor one that was not identified during its fixing
(e.g. a crash every 3 months because the memcmp() above crosses a page
boundary when comparing a string past its 8ths byte and lands into an
unallocated area).

> > But with this in mind, there were two options:
> >   - not releasing the latest fixes
> 
> You are talking about whether to publish a release or not for tree
> X.Y, with the backports that are already in the tree. I don't think
> that's the issue.

That's the concern Vincent initially brought, this is why I'm speaking
about this option.

> I think the discussion should be about what commits land in those old
> trees in the first place.

Yes I agree.

> And I don't believe it is scalable to make
> those decisions during your backporting sessions. Instead I believe we
> should be more conservative when suggesting backports in the commit
> message.

That's OK in an ideal world but the reality is that when there's any
info about backporting, we're already happy. When there's info about
a version, we're very happy, and when that info is accurate, we're
extremely happy. I mean, I've always considered that the person doing
a fix is the best person to analyse the scope and impacts of a bug.
But it's terribly difficult and depends on what is already scheduled
for backports and often it ends up as "should be backported to all
versions", with the final choice in the hands of the poor person
doing the backport.

This is why I'm constantly asking for more details in commits. I'd
prefer if the commit spends 10 lines explaining the nature of the
problem than giving an accurate list of affected versions, because
with an explanation I can try to figure whether that version is
affected or not, and if so, with what level of risk. With just
version information it's easy to overlook something and incorrectly
apply a backport. I know that lots of people find me annoying about
commit messages, but these are instructions to the people doing the
backports, we must always think about that. And guess what ? most of
the time when I'm facing difficulties backporting a fix, I figure
that it's one of mine and that I'm still missing some context info!

> Currently, we say "should/can be backported to X.Y" based on
> whether it is *technically* possible to do so for supported trees, not
> if it makes sense considering the age and lifetime of the suggested
> tree.

Sometimes we do, i.e. "not reasonable to backport past X.Y" or "should
only be backported after an observation period".

> This is why I'm proposing a commit author should make such
> considerations when suggesting backports. Avoiding backports to cold
> trees of no-impact improvements and minor fixes for rare corner cases
> should be a goal.

I think we're all on the same goal, the difficulty is how to achieve it.
I've already mentioned that our current classification of bugs misses
one dimension. We indicate the severity of the bug but not the risk of
taking it. Very often they are similar, I'd say 90% of the time. But
it's not 100 and that means that 10% of those bugs either fix a minor
issue with a high risk or a major issue with a low risk.

If we're more used to mention a risk level in a patch, it will be easier
to figure whether it makes sense to backport a fix to version X or Y at
the time of the backport. And this is more relevant than mentioning a
version (which is a compatibility matrix).

> Unless we insist every single bug needs to be fixed on every single
> supported release branch.

No, that's not my goal. However I want that every bug that affects
users is eventually backported. Note the "eventually" here, it's
important. For me it's not acceptable to force users to upgrade, say,
from 2.0 to 2.2 to fix a bug. We owe fixes for the bugs we write. We
know that sometimes an issue will be so difficult to fix and with so
low importance that it will result in a wont-fix for a branch. I'd
be fine with even having a known-bugs file in certain old branches
to add some entries there when we cannot backport. But I'd like fixes
to be backported in a timely manner if possible, based on their
severity-to-risk ratio. I'm saying "if possible" because taking fixes
out of order also creates new classes of bugs that become specific to
a version and this must be avoided as much as possible. But typically
a fix like the memcmp one could have been queued into the next branch
for a while and merged only with its later fix. One of the difficulties
is that sometimes we have to rush a release for an important bug and
that some pending fixes have not had enough cooling time. If we can
queue fixes faster by level of risk, we should reduce this problem.

Please let the discussion continue on this, I'm really interested in
this subject that a few of us have already started discussing a while
ago with some ideas but nothing outstanding yet.

Also let's be straight on one thing: I'm not aiming for guaranteed
zero regression because that's guaranteed zero fix. I'm aiming for
the maximum level of fixes we can reach with the minimal risk of
regression, and for the fastest fixes possible for these rare
regressions. Applying this rule will mean that regressions are less
common in older versions than newer ones, and will maintain the
promise that it's safe to upgrade within one branch.

Thanks,
Willy

Reply via email to