This is an automated email from the ASF dual-hosted git repository. yjhjstz pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/cloudberry.git
commit 2c238f6fc37c96acdb1e8e40bc7d4f426c4ab0fe Author: Jimmy Yih <[email protected]> AuthorDate: Thu Sep 22 16:53:58 2022 -0700 Make gpactivatestandby do retry loop after standby promote The pg_ctl promote wait logic checks for IN_PRODUCTION state in the promoted coordinator's control file to determine if the standby coordinator has successfully been promoted or not. Unfortunately, that IN_PRODUCTION state does not guarantee that the promoted coordinator is ready to accept database connections yet. Make gpactivatestandby do a retry loop on its forced CHECKPOINT logic to guarantee that the CHECKPOINT goes through and immediately reflects the new timeline id. With this patch, intermittent gpactivatestandby Behave test flakes seen frequently in the Concourse pipeline are resolved. The logic in this patch is also actually somewhat of a revert to previous logic as we used to do the very same retry loop. The retry loop was removed because we wrongly assumed pg_ctl promote wait logic would wait until database connections were ready to be accepted. The try/catch would actually catch the error but it was being ignored (which was intentional from the previous retry logic but wrong in the non-retry logic). GPDB commit references: https://github.com/greenplum-db/gpdb/commit/1fe901d4bb9b141fd6931977eb93515982ac6edd https://github.com/greenplum-db/gpdb/commit/5868160a73ba9cfcebdb61820635301bfb8cc0c1 --- gpMgmt/bin/gpactivatestandby | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/gpMgmt/bin/gpactivatestandby b/gpMgmt/bin/gpactivatestandby index 4aa5e8a97d..ee180022e6 100755 --- a/gpMgmt/bin/gpactivatestandby +++ b/gpMgmt/bin/gpactivatestandby @@ -327,15 +327,25 @@ def promote_standby(coordinator_data_dir): # written to the pg_control file. This happens in async way in # server since pg_ctl promote uses fast_promote. Given gpstart # depends on TLI to be reflect after right after promotion mainly - # for testing we force CHECKPOINT here. + # for testing we force CHECKPOINT here. We loop here since the + # pg_ctl promote wait logic simply checks for IN_PRODUCTION in the + # promoted coordinator's control file but the coordinator's + # postmaster will actually not be ready yet to accept database + # connections for a small period of time. Use the same + # MIRROR_PROMOTION_TIMEOUT of 10 minutes here as well. logger.debug('forcing CHECKPOINT to reflect new TimeLineID...') - try: - dburl = dbconn.DbURL() - conn = dbconn.connect(dburl, utility=True, logConn=False) - dbconn.execSQL(conn, 'CHECKPOINT') - conn.close() - except pygresql.InternalError as e: - pass + for i in range(600): + try: + dburl = dbconn.DbURL() + conn = dbconn.connect(dburl, utility=True, logConn=False) + dbconn.execSQL(conn, 'CHECKPOINT') + conn.close() + return True + except pygresql.InternalError as e: + pass + time.sleep(1) + + return False #------------------------------------------------------------------------- # Main @@ -367,7 +377,9 @@ try: # promote standby, only if the standby is running in recovery if not requires_restart: - promote_standby(options_.coordinator_data_dir) + res = promote_standby(options_.coordinator_data_dir) + if not res: + raise GpActivateStandbyException('Timed out waiting for promoted coordinator to accept database connections.') # now we can access the catalog. promote action has already updated # catalog, so array.coordinator is the old (promoted) standby at this point. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
