On Sun, 2023-11-12 at 21:55 -0500, Tom Lane wrote: > yuansong <yyuans...@126.com> writes: > > In PostgreSQL, when a backend process crashes, it can cause other backend > > processes to also require a restart, primarily to ensure data consistency. > > I understand that the correct approach is to analyze and identify the > > cause of the crash and resolve it. However, it is also important to be > > able to handle a backend process crash without affecting the operation of > > other processes, thus minimizing the scope of negative impact and > > improving availability. To achieve this goal, could we mimic the Oracle > > process by introducing a "pmon" process dedicated to rolling back crashed > > process transactions and performing resource cleanup? I wonder if anyone > > has attempted such a strategy or if there have been previous discussions > > on this topic. > > The reason we force a database-wide restart is that there's no way to > be certain that the crashed process didn't corrupt anything in shared > memory. (Even with the forced restart, there's a window where bad > data could reach disk before we kill off the other processes that > might write it. But at least it's a short window.) "Corruption" > here doesn't just involve bad data placed into disk buffers; more > often it's things like unreleased locks, which would block other > processes indefinitely. > > I seriously doubt that anything like what you're describing > could be made reliable enough to be acceptable. "Oracle does > it like this" isn't a counter-argument: they have a much different > (and non-extensible) architecture, and they also have an army of > programmers to deal with minutiae like undoing resource acquisition. > Even with that, you'd have to wonder about the number of bugs > existing in such necessarily-poorly-tested code paths.
Yes. I think that PostgreSQL's approach is superior: rather than investing in code to mitigate the impact of data corruption caused by a crash, invest in quality code that doesn't crash in the first place. Euphemistically naming a crash "ORA-600 error" seems to be part of their strategy. Yours, Laurenz Albe