September 12, 2022 7:51 AM, "Tom Lane" <t...@sss.pgh.pa.us> wrote:

> Julien Rouhaud <rjuju...@gmail.com> writes:
> 
>> On Sun, Sep 11, 2022 at 06:22:21PM +0000, andrey.ara...@nixaid.com wrote:
>>> Everyone using containerized postgresql image cannot use /var/lib/postgresql
>>> as the mountpoint but has to use /var/lib/postgresql/data instead due to 
>>> this
>>> issue [4] due to [5].
>> 
>> initdb had this behavior for almost 10 years, and for good reasons: it 
>> prevents
>> actual problems that were seen in the field.
> 
> The long and the short of this is that one person losing their data
> outweighs thousands of people being very mildly inconvenienced by
> having to create an extra level of directory. I understand that you
> don't think so, but you'd change your mind PDQ if it was *your* data
> that got irretrievably corrupted.
> 
> We are not going to remove this check.
> 
> If anything, the fault I'd find with the existing code is that it's not
> sufficiently thorough about detecting what is a mount point. AFAICS,
> neither the lost+found check nor the extra-dot-files check will trigger
> on modern filesystems such as XFS. I wonder if we could do something
> like comparing the stat(2) st_dev numbers for the proposed data directory
> vs. its parent directory.
> 
> regards, tom lane


Thank you for the explanation Tom & Julien!

This is not an issue for the containerized PostgreSQL where the mountpoint is a 
required dependency without which the container won't even spawn nor start the 
scripts (such as initdb).
This also makes it difficult to comprehend the reason initdb exits when it sees 
PGDATA is a mountpoint it at the first hand.

That and the "Warn about initdb using mount-points" commit [4] made me thinking 
that this was a typo.

But I do understand that having it exiting on detecting PGDATA being a 
mountpoint have saved people from having their DB corrupted is a good enough 
reason for keeping thing as they are now.


== Summarizing the issue ==

The problem is that PostgreSQL will irrecoverably corrupt when the `PGDATA` 
mountpoint gets accidentally unmounted while the DB is up. [1]

The `PGDATA` can't just accidentally get unmounted due to the files being 
locked by a process there, unless there is a time window between the DB 
locking/unlocking the data in it of course.
Or unless someone forcibly unmounts it.

(I'd expect the DB to detect this & immediately terminate with a fatal error)

Or more likely is when `PGDATA` directory gets mounted back again _while_ 
PostgreSQL is running with a newly initialized cluster in the empty PGDATA 
directory (say, due to start-scripts running `initdb` or there was an old 
instance of a previously initialized cluster in the underlying PGDATA directory 
[with mountpoint being unmounted]), and then gets closed which in turn causes 
it writing the wrong data into the correct pg_control file.


== The solution (partial) ==

The solution to this was built into the `initdb` [4] - which makes sure it 
fails when it finds the `PGDATA` is a mountpoint.
This is currently achieved by `initdb` finding the `lost+found` directory in 
PGDATA, which is not a robust solution, since there are filesystems that do not 
place that file onto the mountpoint such as XFS, BTRFS, eCryptfs, ..., or one 
can erase the `lost+found` directory.
Hence, this is only a partial solution to the problem and looks to be more of a 
workaround nature IMHO.


== The recommendation ==

The recommendation is to mount an upper directory instead of PGDATA directly 
(one above the PGDATA, e.g. `$PGDATA/../.`) to make sure the DB or a certain* 
PostgreSQL start script will fail due to a missing `PGDATA` directory instead 
of running the `initdb` to re-init the cluster.

== The reason ==

This is because, as already mentioned before, running the `initdb` can lead to 
a situation where one can irrecoverably corrupt the DB by mounting the original 
PGDATA backing device back over again and stopping/re-starting the DB which 
will irrecoverably overwrite the good pg_control file.
(* - I assume that on some systems the postgresql start-script runs `initdb` 
when detects PGDATA directory is empty before starting the DB; like in the 
postgresql docker container [2] (by checking the presence of 
`$PGDATA/PG_VERSION`). And the default postgresql start script [3] runs the DB 
directly without issuing `initdb`.)


== Next steps? ==

I'm wondering whether that's the same issue for mariadb / mysql ...

And whether there could be a better way to handle this issue and protect the 
PostgreSQL from corrupting the DB in the event the PGDATA gets accidentally 
unmounted / re-mounted leading the the issue described above.
Solving this would allow people to actually mount the PGDATA directly, without 
the need to fix the `initdb`'s mountpoint detection for filesystems without 
`lost+found` directory.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1247477#c1
[2] 
https://github.com/docker-library/postgres/blob/74e51d102aede/docker-entrypoint.sh#L317
[3] 
https://github.com/postgres/postgres/blob/REL_14_5/contrib/start-scripts/linux#L94
[4] 
https://github.com/postgres/postgres/commit/17f15239325a88581bb4f9cf91d38005f1f52d69


Kind regards,
Andrey Arapov


Reply via email to