Re: snapshot too old issues, first around wraparound and then more.

Robert Haas Wed, 01 Apr 2020 10:28:39 -0700

On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan <p...@bowt.ie> wrote:
> I don't think that it's fair to characterize Andres' actions in that
> situation as in any way irresponsible. We had an extremely complicated
> data corruption bug that he went to great lengths to fix, following
> two other incorrect fixes. He was jet lagged from travelling to India
> at the time. He went to huge lengths to make sure that the bug was
> correctly squashed.


I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

> > Actually removing the code is unnecessary, protects
> > nobody, and has risk.
>
> Every possible approach has risk. We are deciding among several
> unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

Reply via email to