(cloudberry) 37/41: Make gpactivatestandby do retry loop after standby promote

yjhjstz Wed, 01 Jan 2025 19:12:19 -0800

This is an automated email from the ASF dual-hosted git repository.

yjhjstz pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry.git


commit 2c238f6fc37c96acdb1e8e40bc7d4f426c4ab0fe
Author: Jimmy Yih <[email protected]>
AuthorDate: Thu Sep 22 16:53:58 2022 -0700

    Make gpactivatestandby do retry loop after standby promote
    
    The pg_ctl promote wait logic checks for IN_PRODUCTION state in the
    promoted coordinator's control file to determine if the standby
    coordinator has successfully been promoted or not. Unfortunately, that
    IN_PRODUCTION state does not guarantee that the promoted coordinator
    is ready to accept database connections yet. Make gpactivatestandby do
    a retry loop on its forced CHECKPOINT logic to guarantee that the
    CHECKPOINT goes through and immediately reflects the new timeline
    id. With this patch, intermittent gpactivatestandby Behave test
    flakes seen frequently in the Concourse pipeline are resolved.
    
    The logic in this patch is also actually somewhat of a revert to
    previous logic as we used to do the very same retry loop. The retry
    loop was removed because we wrongly assumed pg_ctl promote wait logic
    would wait until database connections were ready to be accepted. The
    try/catch would actually catch the error but it was being ignored
    (which was intentional from the previous retry logic but wrong in the
    non-retry logic).
    
    GPDB commit references:
    
https://github.com/greenplum-db/gpdb/commit/1fe901d4bb9b141fd6931977eb93515982ac6edd
    
https://github.com/greenplum-db/gpdb/commit/5868160a73ba9cfcebdb61820635301bfb8cc0c1
---
 gpMgmt/bin/gpactivatestandby | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/gpMgmt/bin/gpactivatestandby b/gpMgmt/bin/gpactivatestandby
index 4aa5e8a97d..ee180022e6 100755
--- a/gpMgmt/bin/gpactivatestandby
+++ b/gpMgmt/bin/gpactivatestandby
@@ -327,15 +327,25 @@ def promote_standby(coordinator_data_dir):
     # written to the pg_control file. This happens in async way in
     # server since pg_ctl promote uses fast_promote. Given gpstart
     # depends on TLI to be reflect after right after promotion mainly
-    # for testing we force CHECKPOINT here.
+    # for testing we force CHECKPOINT here. We loop here since the
+    # pg_ctl promote wait logic simply checks for IN_PRODUCTION in the
+    # promoted coordinator's control file but the coordinator's
+    # postmaster will actually not be ready yet to accept database
+    # connections for a small period of time. Use the same
+    # MIRROR_PROMOTION_TIMEOUT of 10 minutes here as well.
     logger.debug('forcing CHECKPOINT to reflect new TimeLineID...')
-    try:
-        dburl = dbconn.DbURL()
-        conn = dbconn.connect(dburl, utility=True, logConn=False)
-        dbconn.execSQL(conn, 'CHECKPOINT')
-        conn.close()
-    except pygresql.InternalError as e:
-        pass
+    for i in range(600):
+        try:
+            dburl = dbconn.DbURL()
+            conn = dbconn.connect(dburl, utility=True, logConn=False)
+            dbconn.execSQL(conn, 'CHECKPOINT')
+            conn.close()
+            return True
+        except pygresql.InternalError as e:
+            pass
+        time.sleep(1)
+
+    return False
 
 #-------------------------------------------------------------------------
 # Main
@@ -367,7 +377,9 @@ try:
 
     # promote standby, only if the standby is running in recovery
     if not requires_restart:
-        promote_standby(options_.coordinator_data_dir)
+        res = promote_standby(options_.coordinator_data_dir)
+        if not res:
+            raise GpActivateStandbyException('Timed out waiting for promoted 
coordinator to accept database connections.')
 
     # now we can access the catalog.  promote action has already updated
     # catalog, so array.coordinator is the old (promoted) standby at this 
point.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(cloudberry) 37/41: Make gpactivatestandby do retry loop after standby promote

Reply via email to