Hi folks, I have one realm, one zonegroup, and four zones, all running version 19.2.3. One zone was recently added to the zonegroup while it was on version 18.2.7. The newly added zone can perform data sync with the other secondary zones without issues. However, when syncing with the master zone, it gets stuck in the init state. The master zone, on the other hand, can sync with it successfully.
All zones were recently upgraded from versions 18.2.4 and 18.2.7 to 19.2.3 but the problem still persists. sync status on master zone (dc07): radosgw-admin sync status realm 710cf69b-7382-47d2-aca6-03d991b00d1f (s3-cdn) zonegroup 7c01d60f-88c6-4192-baf7-d725260bf05d (s3-cdn-group) zone 03f6a8ec-008c-4cbf-8efc-d70a6013066f (s3-cdn-dc07) current time 2026-02-09T08:57:47Z zonegroup features enabled: disabled: compress-encrypted, notification_v2, resharding metadata sync: no sync (zone is master) data sync source: 1a6e33b9-8ece-4b9c-a9a5-961fa97c42c8 (s3-cdn-dc05) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shard behind shards: [93] oldest incremental change not applied: 2026-02-09T08:57:44.779833+0000 [93] source: 367dbfe9-a5f8-4101-a271-9749f25ba09c *(s3-cdn-dc10)* *syncing* full sync: 0/128 shards incremental sync: 128/128 shards 1 shard is recovering recovering shards: [125] source: 40122a7c-e594-43b7-89bb-e7ada37991c5 (s3-cdn-dc06) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source sync status on problematic zone (dc10): radosgw-admin sync status realm 710cf69b-7382-47d2-aca6-03d991b00d1f (s3-cdn) zonegroup 7c01d60f-88c6-4192-baf7-d725260bf05d (s3-cdn-group) zone 367dbfe9-a5f8-4101-a271-9749f25ba09c (s3-cdn-dc10) current time 2026-02-09T09:00:17Z zonegroup features enabled: disabled: compress-encrypted, notification_v2, resharding metadata sync: syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 03f6a8ec-008c-4cbf-8efc-d70a6013066f *(s3-cdn-dc07)* *init* full sync: 128/128 shards full sync: 0 buckets to sync incremental sync: 0/128 shards data is behind on 128 shards behind shards: [0–127] source: 1a6e33b9-8ece-4b9c-a9a5-961fa97c42c8 (s3-cdn-dc05) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 1 shard behind shards: [93] oldest incremental change not applied: 2026-02-09T08:59:44.783203+0000 [93] 1 shard is recovering recovering shards: [41] source: 40122a7c-e594-43b7-89bb-e7ada37991c5 (s3-cdn-dc06) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source I see the following logs in the RGW sync enabled service: Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 4: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xb4) [0x5602e312a6f4] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 5: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4d7) [0x5602e3598d97] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 6: /usr/bin/radosgw(+0x837248) [0x5602e32eb248] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 7: (RGWRadosThread::Worker::entry()+0xbd) [0x5602e32ec83d] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 8: /lib64/libc.so.6(+0x8a4da) [0x7fbd2b2394da] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 9: clone() Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: debug -9999> 2026-02-09T09:05:46.708+0000 7fbbf4c19640 -1 *** Caught signal (Segmentation fault) ** Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: in thread 7fbbf4c19640 thread_name:data-sync Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: ceph version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable) Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 1: /lib64/libc.so.6(+0x3ebf0) [0x7fbd2b1edbf0] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 2: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x37) [0x5602e3127c97] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 3: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x481) [0x5602e3129901] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 4: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xb4) [0x5602e312a6f4] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 5: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4d7) [0x5602e3598d97] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 6: /usr/bin/radosgw(+0x837248) [0x5602e32eb248] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 7: (RGWRadosThread::Worker::entry()+0xbd) [0x5602e32ec83d] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 8: /lib64/libc.so.6(+0x8a4da) [0x7fbd2b2394da] Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: 9: clone() Additionally, when I restart the RGW sync enabled service on the newly added zone, the daemon crashes. If I restart the RGW sync enabled daemon directly, it also crashes. To restart RGW sync enabled successfully, I have to run: ceph config set client.rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg rgw_run_sync_thread false ceph orch daemon restart rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg ceph config set client.rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg rgw_run_sync_thread true I saw that this bug (https://tracker.ceph.com/issues/63378) was resolved and backported to 19.2.3. However, I am still observing the same behavior. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
