Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Craig Ringer Mon, 09 Apr 2018 18:56:01 -0700

On 10 April 2018 at 04:25, Mark Dilger <[email protected]> wrote:


> I was reading this thread up until now as meaning that the standby could
> receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make
subsequent WAL writes based on that wrong information. But that's
actually OK, since the standby will have replayed the original WAL
correctly.

I think the only time we'd run into trouble is if we evict the good
(but not written out) data from s_b and the fs buffer cache, then
later read in the old version of a block we failed to overwrite. Data
checksums (if enabled) might catch it unless the write left the whole
block stale. In that case we might generate a full page write with the
stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is
caught promptly, and less so over time. But AFAICS WAL-based
replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of
overprovisioning, is on a non-network file system, and doesn't use
multipath (or sets it up right) this issue *is exceptionally unlikely
to affect you*.

-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Reply via email to