[ceph-users] RGW multisite sync and latencies problem

Miroslav Bohac Tue, 10 Nov 2020 12:53:01 -0800

Hi,
I have problem with RGW in multisite configuration with Nautilus 14.2.11. Both 
zones with SSD and 10Gbps network. Master zone consist from 5x DELL R740XD 
servers (every 256GB RAM, 8x800GB SSD for CEPH, 24xCPU). Secondary zone 
(temporary for testing) consist from 3x HPE DL360 Gen10 servers (every 256GB 
RAM, 6x800GB SSD, 48CPU).


We have 17 test buckets with manual sharding (101 shards). Every bucket with 
10M of small objects (10kB - 15kB). Zonegroup configuration is attached bellow. 
Replication of 150M objects from master to secondary zone takes almost 28 hours 
and the replication completed with success.

After deleting objects from one bucket in master zone is not possible to sync 
zones properly. I tried to restart both secondary RGWs, but without success. 
Sync status on secondary zone is behind master. The number of objects in 
buckets on master zone is different than on secondary zone. 

Ceph HEALTH status is WARNING on both zones. On master zone I have 146 large 
objects found in pool 'prg2a-1.rgw.buckets.index' 16 large objects found in 
pool 'prg2a-1.rgw.log'. On secondary zone 88 large objects found in pool 
'prg2a-2.rgw.log' 1584 large objects found in pool 'prg2a-2.rgw.buckets.index'. 

AVG OSD latencies on secondary zone during sync was "read 0,158ms, write 
1,897ms, overwrite 1,634ms". After unsuccesfull sync (after 12h of sync fall 
down RGW requests, IOPS and throughput) jumps up AVG OSD latencies to "read 
125ms, write 30ms, overwrite 272ms". After stopping of both RGWs on secondary 
zone are AVG OSD latencies almost 0ms, but when I start RGWs on secondary zone 
again, OSD latencies will rise again to "read 125ms, write 30ms, overwrite 
272ms" with spikes up to 3 seconds.

We have seen the same behaviour of ceph multisite with large number of object 
in one bucket (150M+ objects), so we tried different strategy with smaller 
buckets, but results are same.

I will appreciate any help or advice, how tune or diagnose multisite problems.
Does anyone else have any ideas? Is there anyone else with a similar use-case? 
I do not know what is wrong.

Thank you and best regards,
Miroslav

radosgw-admin zonegroup get
{
    "id": "ac0005da-2e9f-4f38-835f-72b289c240d0",
    "name": "prg2a",
    "api_name": "prg2a",
    "is_master": "true",
    "endpoints": [
        "http://s3.prg1a.sys.cz:80";,
        "http://s3.prg2a.sys.cz:80";
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad",
    "zones": [
        {
            "id": "d9ebbd1f-3312-4083-b4c2-843e1fb899ad",
            "name": "prg2a-1",
            "endpoints": [
                "http://10.104.200.101:7480";,
                "http://10.104.200.102:7480";
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": [],
            "redirect_zone": ""
        },
        {
            "id": "fdd76c02-c679-4ec7-8e7d-c14d2ac74fb4",
            "name": "prg2a-2",
            "endpoints": [
                "http://10.104.200.221:7480";,
                "http://10.104.200.222:7480";
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": [],
            "redirect_zone": ""
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": [],
            "storage_classes": [
                "STANDARD"
            ]
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "cb831094-e219-44b8-89f3-fe25fc288c00"

ii  radosgw                              14.2.11-pve1                 amd64     
   REST gateway for RADOS distributed object store
ii  ceph                                 14.2.11-pve1                 amd64     
   distributed storage and file system
ii  ceph-base                            14.2.11-pve1                 amd64     
   common ceph daemon libraries and management tools
ii  ceph-common                          14.2.11-pve1                 amd64     
   common utilities to mount and interact with a ceph storage cluster
ii  ceph-fuse                            14.2.11-pve1                 amd64     
   FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             14.2.11-pve1                 amd64     
   metadata server for the ceph distributed file system
ii  ceph-mgr                             14.2.11-pve1                 amd64     
   manager for the ceph distributed storage system
ii  ceph-mon                             14.2.11-pve1                 amd64     
   monitor server for the ceph storage system
ii  ceph-osd                             14.2.11-pve1                 amd64     
   OSD server for the ceph storage system
ii  libcephfs2                           14.2.11-pve1                 amd64     
   Ceph distributed file system client library
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] RGW multisite sync and latencies problem

Reply via email to