On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan <p...@bowt.ie> wrote: > I don't think that it's fair to characterize Andres' actions in that > situation as in any way irresponsible. We had an extremely complicated > data corruption bug that he went to great lengths to fix, following > two other incorrect fixes. He was jet lagged from travelling to India > at the time. He went to huge lengths to make sure that the bug was > correctly squashed.
I don't mean it as a personal attack on Andres, and I know and am glad that he worked hard on the problem, but I don't agree that it was the right decision. Perhaps "irresponsible" is the wrong word, but it's certainly caused problems for multiple EnterpriseDB customers, and in my view, those problems weren't necessary. Either a WARNING or an ERROR would have shown up in the log, but an ERROR terminates VACUUM for that table and thus basically causes autovacuum to be completely broken. That is a really big problem. Perhaps you will want to argue, as Andres did, that the value of having ERROR rather than WARNING in the log justifies that outcome, but I sure don't agree. > > Actually removing the code is unnecessary, protects > > nobody, and has risk. > > Every possible approach has risk. We are deciding among several > unpleasant and risky alternatives here, no? Sure, but not all levels of risk are equal. Jumping out of a plane carries some risk of death whether or not you have a parachute, but that does not mean that we shouldn't worry about whether you have one or not before you jump. In this case, I think it is pretty clear that hard-disabling the feature by always setting old_snapshot_threshold to -1 carries less risk of breaking unrelated things than removing code that caters to the feature all over the code base. Perhaps it is not quite as dramatic as my parachute example, but I think it is pretty clear all the same that one is a lot more likely to introduce new bugs than the other. A carefully targeted modification of a few lines of code in 1 file just about has to carry less risk than ~1k lines of code spread across 40 or so files. However, I still think that without some more analysis, it's not clear whether we should go this direction at all. Andres's results suggest that there are some bugs here, but I think we need more senior hackers to study the situation before we make a decision about what to do about them. I certainly haven't had enough time to even fully understand the problems yet, and nobody else has posted on that topic at all. I have the highest respect for Andres and his technical ability, and if he says this stuff has problems, I'm sure it does. Yet I'm not willing to conclude that because he's tired and frustrated with this stuff right now, it's unsalvageable. For the benefit of the whole community, such a claim deserves scrutiny from multiple people. Is there any chance that you're planning to look into the details? That would certainly be welcome from my perspective. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company