Dear hackers, While investigating the cfbot failure [1], I found a strange behavior of pg_ctl command. How do you think? Is this a bug to be fixed or in the specification?
# Problem The "pg_ctl start" command returns 0 (succeeded) even if the cluster has already been started. This occurs on Windows environment, and when the command is executed just after postmaster starts. # Analysis The primal reason is in wait_for_postmaster_start(). In this function the postmaster.pid file is read and checked whether the start command is successfully done or not. Check (1) requires that the postmaster must be started after the our pg_ctl command, but 2 seconds delay is accepted. In the linux mode, the check (2) is also executed to ensures that the forked process modified the file, so this time window is not so problematic. But in the windows system, (2) is ignored, *so the pg_ctl command may be succeeded if the postmaster is started within 2 seconds.* ``` if ((optlines = readfile(pid_file, &numlines)) != NULL && numlines >= LOCK_FILE_LINE_PM_STATUS) { /* File is complete enough for us, parse it */ pid_t pmpid; time_t pmstart; /* * Make sanity checks. If it's for the wrong PID, or the recorded * start time is before pg_ctl started, then either we are looking * at the wrong data directory, or this is a pre-existing pidfile * that hasn't (yet?) been overwritten by our child postmaster. * Allow 2 seconds slop for possible cross-process clock skew. */ pmpid = atol(optlines[LOCK_FILE_LINE_PID - 1]); pmstart = atol(optlines[LOCK_FILE_LINE_START_TIME - 1]); if (pmstart >= start_time - 2 && // (1) #ifndef WIN32 pmpid == pm_pid // (2) #else /* Windows can only reject standalone-backend PIDs */ pmpid > 0 #endif ``` # Appendix - how do I found? I found it while investigating the failure. In the test "pg_upgrade --check" is executed just after old cluster has been started. I checked the output file [2] and found that the banner says "Performing Consistency Checks", which meant that the parameter live_check was set to false (see output_check_banner()). This parameter is set to true when the postmaster has been started at that time and the pg_ctl start fails. That's how I find. [1]: https://cirrus-ci.com/task/4634769732927488 [2]: https://api.cirrus-ci.com/v1/artifact/task/4634769732927488/testrun/build/testrun/pg_upgrade/003_logical_replication_slots/data/t_003_logical_replication_slots_new_publisher_data/pgdata/pg_upgrade_output.d/20230905T080645.548/log/pg_upgrade_internal.log Best Regards, Hayato Kuroda FUJITSU LIMITED