Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Mark Dilger Mon, 09 Apr 2018 14:34:20 -0700

> On Apr 9, 2018, at 2:25 PM, Tomas Vondra <[email protected]> wrote:
> 
> 
> 
> On 04/09/2018 11:08 PM, Andres Freund wrote:
>> Hi,
>> 
>> On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:
>>> I can also imagine a master and standby that are similarly provisioned,
>>> and thus hit an out of disk error at around the same time, resulting in
>>> corruption on both, even if not the same corruption.
>> 
>> I think it's a grave mistake conflating ENOSPC issues (which we should
>> solve by making sure there's always enough space pre-allocated), with
>> EIO type errors.  The problem is different, the solution is different.


I'm happy to take your word for that.

> In any case, that certainly does not count as data corruption spreading
> from the master to standby.

Maybe not from the point of view of somebody looking at the code.  But a
user might see it differently.  If the data being loaded into the master
and getting replicated to the standby "causes" both to get corrupt, then
it seems like corruption spreading.  I put "causes" in quotes because there
is some argument to be made about "correlation does not prove cause" and so
forth, but it still feels like causation from an arms length perspective.
If there is a pattern of standby servers tending to fail more often right
around the time that the master fails, you'll have a hard time comforting
users, "hey, it's not technically causation."  If loading data into the
master causes the master to hit ENOSPC, and replicating that data to the
standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has
not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my
standby servers right soon now.

mark

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Reply via email to