On Wed, 28 May 2025, Tomas Vondra wrote:

Isn't guaranteeing success of a write a general issue with compressed
filesystem? Why is posix_fallocate() any special in this regard?
Shouldn't the filesystem be defensive and assume the data is not
compressible? Or maybe just return EOPNOTSUPP when in doubt.

It's not simple for CoW filesystems, including Btrfs and ZFS. What I know is that the current design is a compromise, it's not that the developers are happy with it. I can point you to some discussion, with pointers to further discussions if you are interested:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2

BTW even in the last case, PostgreSQL would not notice the lack of
fallocate() support as glibc implements a userspace fallback in
posix_fallocate(). That fallback has its own issues that hopefully will
not affect postgres (see CAVEATS in man 3 posix_fallocate).


Well, if btrfs starts returning EOPNOTSUPP, and glibc switches to the
userspace fallback, we wouldn't notice. But that's up to the btrfs to
decide if they want to support fallocate. We still need our fallback
anyway, because of other OSes.

Btrfs has decided a few years back: they will "support" fallocate, but because real support is very difficult, they disable compression (among others) for files with fallocate'd ranges. They can't change that and return EOPNOTSUPP out of the blue now, but they are open to adding a mount option to optionally do that:

https://marc.info/?l=linux-btrfs&m=174310663519516&w=2


Should PostgreSQL provide a setting to avoid the use of fallocate()? Or is
it the filesystem at fault for not returning EOPNOTSUPP, in which case
postgres would use its fallback code?


I don't have a clear opinion on whether it's a filesystem issue. Maybe
we should be handling this differently, not sure.

All I'm saying is that this is a regression for PostgreSQL users that keep tablespaces on compressed Btrfs. What could be done from postgres, is to provide a runtime setting for avoiding fallocate(), going instead through the old code path. Idelly this would be an option per tablespace, but even a global one is better than nothing.




Thanks,
Dimitris



Reply via email to