On 9 April 2018 at 11:50, Anthony Iliopoulos <ail...@altatus.com> wrote:
> On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:
>> On 8 April 2018 at 22:47, Anthony Iliopoulos <ail...@altatus.com> wrote:

> To make things a bit simpler, let us focus on EIO for the moment.
> The contract between the block layer and the filesystem layer is
> assumed to be that of, when an EIO is propagated up to the fs,
> then you may assume that all possibilities for recovering have
> been exhausted in lower layers of the stack.

Well Postgres is using the filesystem. The interface between the block
layer and the filesystem may indeed need to be more complex, I
wouldn't know.

But I don't think "all possibilities" is a very useful concept.
Neither layer here is going to be perfect. They can only promise that
all possibilities that have actually been implemented have been
exhausted. And even among those only to the degree they can be done
automatically within the engineering tradeoffs and constraints. There
will always be cases like thin provisioned devices that an operator
can expand, or degraded raid arrays that can be repaired after a long
operation and so on. A network device can't be sure whether a remote
server may eventually come back or not and have to be reconfigured by
a human or system automation tool to point to the new server or new
network configuration.

> Right. This implies though that apart from the kernel having
> to keep around the dirtied-but-unrecoverable pages for an
> unbounded time, that there's further an interface for obtaining
> the exact failed pages so that you can read them back.

No, the interface we have is fsync which gives us that information
with the granularity of a single file. The database could in theory
recognize that fsync is not completing on a file and read that file
back and write it to a new file. More likely we would implement a
feature Oracle has of writing key files to multiple devices. But
currently in practice that's not what would happen, what would happen
would be a human would recognize that the database has stopped being
able to commit and there are hardware errors in the log and would stop
the database, take a backup, and restore onto a new working device.
The current interface is that there's one error and then Postgres
would pretty much have to say, "sorry, your database is corrupt and
the data is gone, restore from your backups". Which is pretty dismal.

> There is a clear responsibility of the application to keep
> its buffers around until a successful fsync(). The kernels
> do report the error (albeit with all the complexities of
> dealing with the interface), at which point the application
> may not assume that the write()s where ever even buffered
> in the kernel page cache in the first place.

Postgres cannot just store the entire database in RAM. It writes
things to the filesystem all the time. It calls fsync only when it
needs a write barrier to ensure consistency. That's only frequent on
the transaction log to ensure it's flushed before data modifications
and then periodically to checkpoint the data files. The amount of data
written between checkpoints can be arbitrarily large and Postgres has
no idea how much memory is available as filesystem buffers or how much
i/o bandwidth is available or other memory pressure there is. What
you're suggesting is that the application should have to babysit the
filesystem buffer cache and reimplement all of it in user-space
because the filesystem is free to throw away any data any time it
chooses?

The current interface to throw away filesystem buffer cache is
unmount. It sounds like the kernel would like a more granular way to
discard just part of a device which makes a lot of sense in the age of
large network block devices. But I don't think just saying that the
filesystem buffer cache is now something every application needs to
re-implement in user-space really helps with that, they're going to
have the same problems to solve.

-- 
greg

Reply via email to