Patrick,

On Thu, May 07, 2020 at 07:42:49PM +0200, Tim Düsterhus wrote:
> >> Please note that I review patches on a voluntary basis. I'm not an
> >> "employed first level reviewer".
> > 
> > That's not what I meant. I thought that when regular participants on
> > this list do not spot the errors of first time contributes, it can't be
> > that obvious and directing to CONTRIBUTING might not be enough.
> 
> I agree with you that Willy's reply was overly blunt in this case.
> Especially since you demonstrated an effort to take my review into a
> account and he noticed my Ack (implying it can't be too bad).

In case that happened, I really didn't mean to offend you. It's just
that I'm investing a huge amount of time doing something that nobody
likes, to be honest, which is documenting processes to help everyone
be more autonomous and let the project scale better and rely less on
my shoulders. And certain days I spend 100% of my time dealing with
issues which should never have happened in the first place if the doc
had been read, making me wonder if it's really worth wasting time
documenting. So these days when I end up on yet-another-one in that
band, it can be one too much and I let it rot until next time I have
the opportunity to visit it again.

This is also the reason why I ask people to post their patches here (as
you did), so that comments are shared and help others get better when
it's their turn. No review here is meant to judge the person or attack
anyone. We can criticize code and processes but not anyone's work. The
first and unrivaled bug author here is me, so everyone is safe regarding
this, nobody risks to be laughed at!

I'm always grateful for a contribution because I know how painful it is
to contribute to a project you're not accustomed to, and I really value
the effort way more than the result (which is why I continue to silently
fix some contributions instead of asking people to resubmit with a minor
change).

The root cause of the problem that happened is not trivial, but it's not
complicated to understand either once you get the whole picture. Today
the development time is spent this way:

    1: analyzing issues and fixing bugs (*more* than 50% of the time)
    2: reviewing code to try to reduce future impact on #1 above
    3: trying to untangle the mess of old code when trying to add something
    4: actually adding the 5 lines that were really necessary for a new
       given feature

#1 is mostly affected by "how is it possible to do end up in this situation?",
immediately followed by "but WHY does it HAVE TO work like this?". This "WHY"
is exactly what must appear in a commit message, it's the *justification*
that a change is necessary. Consider that you're *selling* your patch to the
community and that the people maintaining the project are going to devote a
little bit more of their own time to keep your code alive forever. You want
to convince them that your change is really worth the invested time. And
sadly this "why" is often missing in a lot of patches. It does have
devastating effects on bug fixing. It happens sometimes that an issue is
incorrectly fixed because the original intent of a change was not well
understood, and that it's only after 5-6 iterative fixes, and in-person
discussions with several engineers that it becomes obvious that everyone
went down the wrong road and that this has to be completely unrolled and
done differently. A good example of this is all the mess that finally
ended with this commit:

  04400bc787 ("BUG/MAJOR: stream-int: Don't receive data from mux until 
SI_ST_EST is reached")

This is extremely time-consuming and the cause of many of the bugs that
take a long time to fix.

Further, when backporting fixes, there are always differences between
branches. We try to limit the amount of annoying changes that further add
to the maintenance cost, because seeing patches fail to apply or build in
each and every stable branch, or requiring complex adaptations is a real
pain and is the *only* cause for stable branches sometimes lagging behind
for a while and missing some fixes. As a hint, if your patch has zero
impact on any existing setup and can trivially be backported, just asking
kindly will often be enough to get it backported. Some here already know
this and use this. This is important as it saves some users from the risk
of upgrading to a much more recent branch for something trivial. It often
happens for sample fetch functions or converters.

Now you might think "but my change is trivial and will not cause any
issue". Before responding to this, let me show you what I spent my whole
afternoon on because I think it's very representative of this process.

In 2018 a user reported a bug on Discourse. Lukas helped the person,
diagnosed the issue and proposed this fix:

  commit fd9b68c48ecdba5e7971899f4eec315c8e3a3cfe
  Author: Lukas Tribus <[email protected]>
  Date:   Sat Oct 27 20:06:59 2018 +0200

    BUG/MINOR: only mark connections private if NTLM is detected
    
    Instead of marking all connections that see a 401/407 response private
    (for connection reuse), this patch detects a RFC4559/NTLM authentication
    scheme and restricts the private setting to those connections.
    
    This is so we can reuse connections with 401/407 responses with
    deterministic load balancing algorithms later (which requires another fix).
    
    This fixes the problem reported here by Elliot Barlas :
    
      
https://discourse.haproxy.org/t/unable-to-configure-load-balancing-per-request-over-persistent-connection/3144
    
    Should be backported to 1.8.

All the description is there and the fix is absolutely trivial and does
exactly what it claims, as such it was merged into 1.9-dev5 and
backported to 1.8 as requeted.
  
In parallel Christopher was working on HTX which was about to be
merged (1.9-dev7). 

First issue, it's only in 1.9-dev9 that during careful code review it
was noticed that the equivalent for this fix was not implemented in HTX
and was thus done later (no harm to stable branches though):

  commit 6160832bf7b4a4df47406e63daf441529fa62631
  Author: Christopher Faulet <[email protected]>
  Date:   Fri Nov 23 16:23:45 2018 +0100

    BUG/MINOR: proto_htx: only mark connections private if NTLM is detected
    
    The commit fd9b68c48 ("BUG/MINOR: only mark connections private if NTLM is
    detected") was forgotten when HTX analyzers were added.

In 2.0, http-reuse finally became the default, and a user recently reported
a serious issue of users being mixed on a same server when using NTLM (github
issue #581). The user provided a config, with nothing apparently wrong. Ilya
set up a lab with NTLM to try reproduce the issue. 5 or 6 people worked
on this issue, emitting hypothesis and trying to validate them. Finally
thanks to Ilya's lab, Olivier managed to understand the problem (poor
documentation of NTLM regarding optional arguments) and fix the bug. 6
days had elapsed since the report. The fatigue caused on participants
started to be visible in that the commit was pushed without a reference
to the issue, which further adds cost dealing with it in backports.

Today while planning to work on backporting this fix I was wondering if
other cases could happen based on Ilya's traces. I noticed that Olivier's
fix introduced a risk of bug, and while fixing it I discovered that the
very initial patch was wrong as well on one of the authentication schemes.
And to complete the failures, the issue number I referenced in both commit
messages (the one for Olivier's fix and the one for Lukas' fix) was wrong!

So for this initial fix we now have something like 7 successive patches,
with different implementations in older versions, different files affected,
and they will have to be reimplemented by hand differently in 2.1, 2.0 and
1.8. That's probably worth yet another day of work. That's 24 days after
the issue was reported. All this just for a suboptimal load-balancing on
a server using Basic auth initially.

If Lukas' initial patch hadn't been that detailed for so small a patch,
we'd simply have reverted it 24 days ago thinking "bah what's the point,
it's probably not needed anymore"! Fortunately he provided all the arguments
for his change to stay alive in the middle of all this mess, whose cause
finally was an extremely poor documentation on Microsoft's hacking around
the HTTP protocol.

All this to say that EVERY CHANGE CREATES TECHNICAL DEBT AND MUST JUSTIFY IT.

Now getting back to the time-consuming points above, #2 reviewing code.
It happens that it takes a huge amount of time, requires extreme context-
switching, and is a thankless effort because you see the same mistakes
all the day but from different people, wondering how to broadcast some
rules that everyone would read and follow. We're not as demanding as the
Linux kernel regarding code formatting, but despite this it takes a huge
amount of time. And it's simply not possible to devote as much time to
each and every patch. Sometimes I figure from a patchset or because I'm
used to the person that it's worth spending time writing a detailed review
or response (like I'm doing now because you clearly showed efforts and
seem worth investing on), and sometimes I'm too tired or just fed up with
repeating my self and I give up. And I'm glad that other participants are
starting to review code, that definitely helps scaling, we need even more!

The third point was about the time it takes to unroll everything when
you want to add 5 lines of code and discover that it will not be possible.
Sometimes this is caused by a #include dependency hell, sometimes by changes
that were begun for something and not pushed far enough, sometimes due to
old code that you don't know if you can get rid of or not. And this
process suffers from the exact same issues as the first one: being able
to take an educated decision based on all the "WHY" in the changes you're
going to partially undo. The clearer the justification for a change, the
faster the developer will be at taking the right decision without having
to go through hundreds of commits or reading the mailing list archives
and github issues trying to recover an old discussion on the subject,
and causing less bugs due to his change missing an undocumented corner
case.

So this is why I'm a bit demanding on the details in commit messages.
Now let's be fair, we're all human and we all fail from time to time. I'm
never expecting anyone to be fail-safe, I'm just trying to make sure the
amount of invested efforts before the patch is merged is high enough so
that the cost to maintain it gets much lower.

You said you had a look at CONTRIBUTING before submitting, so I feel how
frustrating it can be to take my remark back after this. And being the
author of that file I'll say that it's my fault and I failed. I'd like
to ask you what I should do to try to make some parts more understandable
or more prominent. Should I add a shortcut for those experienced with
contributions to other projects and just seeking the basic rules maybe ?
Anything else ?

Regarding the problem with your initial patch, based on what is described
above, you might now understand it. Having a patch that just creates a
new file and moves a function there and updates the makefile to use it
with zero justification is only going to cause some extra work for all
those involved in the maintenance. But sometimes it may be worth it because
it brings savings elsewhere. That's exactly what ought to be in this commit
message: why you decided to create a new file and move that function there.

And by doing so you further reduce the round trips of the review. In the
current situation someone has to respond to you "why did you decide to move
that function there ?", and then you'll respond with the justification that
would have been in the commit message, then the person (or another one) would
argue for or against this. And just to let you know, this is absolutely not
made up, because Christopher seeing this just asked me privately "are you
really sure we want to move it there?". By having that justification in the
commit message the first step above is already done.

Actually many of the patches that end up without response are those which
require an extra round-trip (not all, because I'm definitely missing some).
And often being caught by other stuff, when I get time again to revisit
them they're far away behind a forest of mails, questions, issues and other
patches so they're definitely missed.

I've thought several times about writing a small script a-la-checkpatch
just to verify such things, but the fact is that people who currently don't
find the info about this will not find the info about the need to run that
script, so it sounds pointless. Hence I continue to think that improving
the doc is the best solution.

Sorry for this long response, but I hope it clarifies a few points. Just
don't take anything personal here, I sincerely hope that spending one hour
explaining this can save your and others' time in the future. I already
planned to write an article about how to best contribute, I think I'll be
able to pick some points from this mail :-)

Since I've seen that Christopher was already having a look at your patch,
I'm deferring to him on this matter. But if you have any question about
any of the points above, feel free to ask!

Thanks,
Willy

Reply via email to