[I] It takes a very long time to create checkpoint for a replica with 0 or 1 record while adding new duplication [incubator-pegasus]

via GitHub Wed, 10 Jul 2024 02:01:49 -0700


empiredan opened a new issue, #2069:
URL: https://github.com/apache/incubator-pegasus/issues/2069


   Firstly, we create a new table in the source cluster:
   ```
   >>> create test1 -p 8 -r 1
   create app test1 succeed, waiting for app ready
   test1 not ready yet, still waiting... (0/8)
   test1 is ready now: (8/8)
   test1 is ready now!
   create app "test1" succeed
   
   >>> ls -d
   [general_info]
   app_id  status     app_name                  app_type  partition_count  
replica_count  is_stateful  create_time          drop_time  drop_expire  
envs_count
   1       AVAILABLE  __detect                  pegasus   8                1    
          true         2024-07-10_11:55:48  -          -            0
   2       AVAILABLE  __stat                    pegasus   8                1    
          true         2024-07-10_11:55:48  -          -            0
   3       AVAILABLE  temp                      pegasus   8                1    
          true         2024-07-10_11:55:48  -          -            0
   4       AVAILABLE  xyz                       pegasus   8                1    
          true         2024-07-10_11:56:36  -          -            0
   5       AVAILABLE  test1                     pegasus   8                1    
          true         2024-07-10_15:32:13  -          -            0
   ```
   
   Then, put 2 different records into the table:
   ```
   >>> use test1
   OK
   >>> set a b c
   OK
   
   app_id          : 5
   partition_index : 4
   decree          : 1
   server          : 10.1.128.223:8171
   >>> set 1 2 3
   OK
   
   app_id          : 5
   partition_index : 2
   decree          : 1
   server          : 10.1.128.223:8171
   >>> full_scan
   partition: all
   hash_key_filter_type: no_filter
   sort_key_filter_type: no_filter
   value_filter_type: no_filter
   max_count: -1
   timout_ms: 5000
   detailed: false
   no_value: false
   
   "a" : "b" => "c"
   "1" : "2" => "3"
   
   2 key-value pairs got.
   ```
   
   Add a new duplication for the table `test1`:
   ```
   >>> add_dup test1 target_cluster -s -a test_dup_1 -r 3
   trying to add duplication [app_name: test1, remote_cluster_name: 
target_cluster, is_duplicating_checkpoint: true, remote_app_name: test_dup_1, 
remote_replica_count: 3]
   adding duplication succeed [app_name: test1, remote_cluster_name: 
target_cluster, appid: 5, dupid: 1720596785, is_duplicating_checkpoint: true, 
remote_app_name: test_dup_1, remote_replica_count: 3]
   ```
   
   Check the status of the table in remote cluster, during several tens of 
minutes it would keep the dead state (all of the 3 replicas are unavailable):
   ```
   >>> ls -d
   [general_info]
   app_id  status     app_name                  app_type  partition_count  
replica_count  is_stateful  create_time          drop_time  drop_expire  
envs_count
   1       AVAILABLE  __detect                  pegasus   8                3    
          true         2024-07-10_12:00:37  -          -            0
   2       AVAILABLE  __stat                    pegasus   8                3    
          true         2024-07-10_12:00:37  -          -            0
   3       AVAILABLE  temp                      pegasus   8                3    
          true         2024-07-10_12:00:37  -          -            0
   4       AVAILABLE  xyz                       pegasus   8                3    
          true         2024-07-10_12:01:51  -          -            0
   5       AVAILABLE  test_dup_1                pegasus   8                3    
          true         2024-07-10_15:33:38  -          -            3
   
   [healthy_info]
   app_id  app_name                  partition_count  fully_healthy  unhealthy  
write_unhealthy  read_unhealthy
   1       __detect                  8                8              0          
0                0
   2       __stat                    8                8              0          
0                0
   3       temp                      8                8              0          
0                0
   4       xyz                       8                8              0          
0                0
   5       test_dup_1                8                0              8          
8                8
   
   [summary]
   total_app_count            : 5
   fully_healthy_app_count    : 4
   unhealthy_app_count        : 1
   write_unhealthy_app_count  : 1
   read_unhealthy_app_count   : 1
   ```
   
   Check the checkpoints of table `test1` in source cluster, it is found that 
there is not any checkpoint created for any replica.
   ```
   $ ll 5.*.pegasus/data
   5.0.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.1.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.2.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.3.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.4.pegasus/data:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 163 Jul 10 15:32 rdb
   
   5.5.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.6.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   
   5.7.pegasus/data:
   total 0
   drwxr-xr-x 2 data data 163 Jul 10 15:32 rdb
   ```
   
   After nearly one hour(from `15:32` to `16:27`), all of the checkpoints were 
created for each replica:
   ```
   [dev (v.v) sa_cluster@hybrid01 reps]$ ll 5.*.pegasus/data/
   5.0.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:12 checkpoint.24
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:12 rdb
   
   5.1.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:07 checkpoint.21
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:07 rdb
   
   5.2.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:07 checkpoint.21
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:07 rdb
   
   5.3.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:12 checkpoint.24
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:12 rdb
   
   5.4.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:22 checkpoint.30
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:22 rdb
   
   5.5.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:12 checkpoint.24
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:12 rdb
   
   5.6.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:07 checkpoint.21
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:07 rdb
   
   5.7.pegasus/data/:
   total 0
   drwxr-xr-x 2 sa_cluster sa_group 147 Jul 10 16:27 checkpoint.33
   drwxr-xr-x 2 sa_cluster sa_group 199 Jul 10 16:27 rdb
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] It takes a very long time to create checkpoint for a replica with 0 or 1 record while adding new duplication [incubator-pegasus]

Reply via email to