This is an automated email from the ASF dual-hosted git repository.

yjhjstz pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry.git

commit d8f2bcb603201e663fd3571eb9ea1a278dea4467
Author: Huansong Fu <[email protected]>
AuthorDate: Wed Sep 20 12:12:10 2023 -0700

    Fix a flakiness with test gp_check_files
    
    This should have be done with #16428, but we need to disable autovacuum when
    running the gp_check_files regress test. Otherwise we might see errors like:
    
    ```
    @@ -53,12 +53,8 @@
     -- check orphaned files, note that this forces a checkpoint internally.
     set client_min_messages = ERROR;
     select gp_segment_id, filename from run_orphaned_files_view();
    - gp_segment_id | filename
    ----------------+----------
    -             1 | 987654
    -             1 | 987654.3
    -(2 rows)
    -
    +ERROR:  failed to retrieve orphaned files after 10 minutes of retries.
    +CONTEXT:  PL/pgSQL function run_orphaned_files_view() line 19 at RAISE
     reset client_min_messages;
    ```
    
    In the log we have:
    ```
    2023-09-20 15:33:00.766420 
UTC,"gpadmin","regression",p148081,th-589358976,"[local]",,2023-09-20 15:31:39 
UTC,0,con19,cmd65,seg-1,,dx38585,,sx1,"LOG","00000","attempt failed 17 with 
error: There is a client session running on one or more segment. 
Aborting...",,,,,"PL/pgSQL function run_orphaned_files_view() line 11 at 
RAISE","select gp_segment_id, filename from 
run_orphaned_files_view();",0,,"pl_exec.c",3857,
    
    ```
    
    It is possible that some background jobs have created some backends that we
    think we should avoid when taking the gp_check_orphaned_files view. As we 
have
    decided to make the view conservative (disallowing any backends that could
    cause false positive of the view results), fixing the test is what we need.
    
    In the test we have a safeguard which is to loop 10 minutes and take the 
view
    repeatedly (function run_orphaned_files_view()). But it didn't solve the 
issue
    because it saw only one snapshot of pg_stat_activity in the entire 
execution of
    the function. Now explicitly call pg_stat_clear_snapshot() to solve that 
issue.
    
    Co-authored-by: Ashwin Agrawal [email protected]
---
 src/test/regress/input/gp_check_files.source  | 2 ++
 src/test/regress/output/gp_check_files.source | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/test/regress/input/gp_check_files.source 
b/src/test/regress/input/gp_check_files.source
index 9ee8509a0a..5e1490d953 100644
--- a/src/test/regress/input/gp_check_files.source
+++ b/src/test/regress/input/gp_check_files.source
@@ -31,6 +31,8 @@ BEGIN
                 RAISE LOG 'attempt failed % with error: %', retry_counter + 1, 
SQLERRM;
                 -- When an exception occurs, wait for 5 seconds and then retry
                 PERFORM pg_sleep(5);
+                -- Refresh to get the latest pg_stat_activity
+                PERFORM pg_stat_clear_snapshot();
                 retry_counter := retry_counter + 1;
         END;
     END LOOP;
diff --git a/src/test/regress/output/gp_check_files.source 
b/src/test/regress/output/gp_check_files.source
index 2d6a733db1..70bf5cf9ae 100644
--- a/src/test/regress/output/gp_check_files.source
+++ b/src/test/regress/output/gp_check_files.source
@@ -29,6 +29,8 @@ BEGIN
                 RAISE LOG 'attempt failed % with error: %', retry_counter + 1, 
SQLERRM;
                 -- When an exception occurs, wait for 5 seconds and then retry
                 PERFORM pg_sleep(5);
+                -- Refresh to get the latest pg_stat_activity
+                PERFORM pg_stat_clear_snapshot();
                 retry_counter := retry_counter + 1;
         END;
     END LOOP;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to