I reached out to the Linux ext4 devs, here is ty...@mit.edu response:
This isn't actually an ext4 issue, but a long-standing VFS/MM issue.
There are going to be multiple opinions about what the right thing to
do. I'll try to give as unbiased a description as possible, but
certainly some of this is going to be filtered by my own biases no
matter how careful I can be.
First of all, what storage devices will do when they hit an exception
condition is quite non-deterministic. For example, the vast majority
of SSD's are not power fail certified. What this means is that if
they suffer a power drop while they are doing a GC, it is quite
possible for data written six months ago to be lost as a result. The
LBA could potentialy be far, far away from any LBA's that were
recently written, and there could have been multiple CACHE FLUSH
operations in the since the LBA in question was last written six
months ago. No matter; for a consumer-grade SSD, it's possible for
that LBA to be trashed after an unexpected power drop.
Which is why after a while, one can get quite paranoid and assume that
the only way you can guarantee data robustness is to store multiple
copies and/or use erasure encoding, with some of the copies or shards
written to geographically diverse data centers.
Secondly, I think it's fair to say that the vast majority of the
companies who require data robustness, and are either willing to pay
$$$ to an enterprise distro company like Red Hat, or command a large
enough paying customer base that they can afford to dictate terms to
an enterprise distro, or hire a consultant such as Christoph, or have
their own staffed Linux kernel teams, have tended to use O_DIRECT. So
for better or for worse, there has not been as much investment in
buffered I/O and data robustness in the face of exception handling of
Next, the reason why fsync() has the behaviour that it does is one
ofhe the most common cases of I/O storage errors in buffered use
cases, certainly as seen by the community distros, is the user who
pulls out USB stick while it is in use. In that case, if there are
dirtied pages in the page cache, the question is what can you do?
Sooner or later the writes will time out, and if you leave the pages
dirty, then it effectively becomes a permanent memory leak. You can't
unmount the file system --- that requires writing out all of the pages
such that the dirty bit is turned off. And if you don't clear the
dirty bit on an I/O error, then they can never be cleaned. You can't
even re-insert the USB stick; the re-inserted USB stick will get a new
block device. Worse, when the USB stick was pulled, it will have
suffered a power drop, and see above about what could happen after a
power drop for non-power fail certified flash devices --- it goes
double for the cheap sh*t USB sticks found in the checkout aisle of
So this is the explanation for why Linux handles I/O errors by
clearing the dirty bit after reporting the error up to user space.
And why there is not eagerness to solve the problem simply by "don't
clear the dirty bit". For every one Postgres installation that might
have a better recover after an I/O error, there's probably a thousand
clueless Fedora and Ubuntu users who will have a much worse user
experience after a USB stick pull happens.
I can think of things that could be done --- for example, it could be
switchable on a per-block device basis (or maybe a per-mount basis)
whether or not the dirty bit gets cleared after the error is reported
to userspace. And perhaps there could be a new unmount flag that
causes all dirty pages to be wiped out, which could be used to recover
after a permanent loss of the block device. But the question is who
is going to invest the time to make these changes? If there is a
company who is willing to pay to comission this work, it's almost
certainly soluble. Or if a company which has a kernel on staff is
willing to direct an engineer to work on it, it certainly could be
solved. But again, of the companies who have client code where we
care about robustness and proper handling of failed disk drives, and
which have a kernel team on staff, pretty much all of the ones I can
think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try
to make buffered writes and error reporting via fsync(2) work well.
In general these companies want low-level control over buffer cache
eviction algorithms, which drives them towards the design decision of
effectively implementing the page cache in userspace, and using
If you are aware of a company who is willing to pay to have a new
kernel feature implemented to meet your needs, we might be able to
refer you to a company or a consultant who might be able to do that
work. Let me know off-line if that's the case...
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
*** A fault and talent of mine is to tell it exactly how it is. ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
***** Unless otherwise stated, opinions are my own. *****