Hello everyone.

Recently when I was running regression tests, I got 'Database "contrib_regression" does not exist' error. After I reproduce the problem, I found it is an auto-vacuum worker process who complains about this error.

Then I tried to analyze the code. When this auto-vacuum worker process is forked from PostMaster and get into `InitPostgres` in postinit.c, it will do following steps:

1. Use the oid of current database to search for the tuple in catalog, and get the database name. During this time, it will add AccessShareLock on catalog and release it after scan;
2. Call LockSharedObject to add RowExclusiveLock on catalog
3. Use database name to search catalog again, make sure the tuple of current database still exists.

During the interval between step 1 and 2, the catalog is not protected by any lock, so that another backend process can drop the database successfully, causing current process complains about database does not exist in step 3.

This issue could not only happen between auto vacuum worker process and backend process, but also can happen between two backend processes, given the special interleaving order of processes. We can use psql to connect to the database, and make the backend process stops at the interval between step 1 and 2, and let another backend process drop this database, then the first backend process will complain about this error.

I am confused about whether this error should happen in regression testing? Is it possible to lock the catalog at step 1 and hold it, so that another process will not have the chance to drop the database, since dropdb needs to lock the catalog with AccessExclusiveLock? And what is the consideration of the design at these 3 steps?

Hopefully to get some voice from kernel hackers, thanks~


--
Best Regards,

Jingtang

——————————————————————

Jingtang Zhang

E-Mail: mrdrivingd...@gmail.com
GitHub: @mrdrivingduck

Sent from Microsoft Surface Book 2.



Reply via email to