[ceph-users] RGW/multisite: Segmentation fault during full sync

Vahideh Alinouri via ceph-users Mon, 09 Feb 2026 01:22:42 -0800

Hi folks,

I have one realm, one zonegroup, and four zones, all running version
19.2.3. One zone was recently added to the zonegroup while it was on
version 18.2.7. The newly added zone can perform data sync with the other
secondary zones without issues. However, when syncing with the master zone,
it gets stuck in the init state. The master zone, on the other hand, can
sync with it successfully.


All zones were recently upgraded from versions 18.2.4 and 18.2.7 to 19.2.3
but the problem still persists.

sync status on master zone (dc07): radosgw-admin sync status
realm 710cf69b-7382-47d2-aca6-03d991b00d1f (s3-cdn)
zonegroup 7c01d60f-88c6-4192-baf7-d725260bf05d (s3-cdn-group)
zone 03f6a8ec-008c-4cbf-8efc-d70a6013066f (s3-cdn-dc07)
current time 2026-02-09T08:57:47Z

zonegroup features enabled:
  disabled: compress-encrypted, notification_v2, resharding

metadata sync: no sync (zone is master)

data sync source: 1a6e33b9-8ece-4b9c-a9a5-961fa97c42c8 (s3-cdn-dc05)
  syncing
  full sync: 0/128 shards
  incremental sync: 128/128 shards
  data is behind on 1 shard
  behind shards: [93]
  oldest incremental change not applied: 2026-02-09T08:57:44.779833+0000
[93]

source: 367dbfe9-a5f8-4101-a271-9749f25ba09c *(s3-cdn-dc10)*
  *syncing*
  full sync: 0/128 shards
  incremental sync: 128/128 shards
  1 shard is recovering
  recovering shards: [125]

source: 40122a7c-e594-43b7-89bb-e7ada37991c5 (s3-cdn-dc06)
  syncing
  full sync: 0/128 shards
  incremental sync: 128/128 shards
  data is caught up with source


sync status on problematic zone (dc10): radosgw-admin sync status

realm 710cf69b-7382-47d2-aca6-03d991b00d1f (s3-cdn)
zonegroup 7c01d60f-88c6-4192-baf7-d725260bf05d (s3-cdn-group)
zone 367dbfe9-a5f8-4101-a271-9749f25ba09c (s3-cdn-dc10)
current time 2026-02-09T09:00:17Z

zonegroup features enabled:
  disabled: compress-encrypted, notification_v2, resharding

metadata sync:
  syncing
  full sync: 0/64 shards
  incremental sync: 64/64 shards
  metadata is caught up with master

data sync source: 03f6a8ec-008c-4cbf-8efc-d70a6013066f *(s3-cdn-dc07)*
  *init*
  full sync: 128/128 shards
  full sync: 0 buckets to sync
  incremental sync: 0/128 shards
  data is behind on 128 shards
  behind shards: [0–127]

source: 1a6e33b9-8ece-4b9c-a9a5-961fa97c42c8 (s3-cdn-dc05)
  syncing
  full sync: 0/128 shards
  incremental sync: 128/128 shards
  data is behind on 1 shard
  behind shards: [93]
  oldest incremental change not applied: 2026-02-09T08:59:44.783203+0000
[93]
  1 shard is recovering
  recovering shards: [41]

source: 40122a7c-e594-43b7-89bb-e7ada37991c5 (s3-cdn-dc06)
  syncing
  full sync: 0/128 shards
  incremental sync: 128/128 shards
  data is caught up with source

I see the following logs in the RGW sync enabled service:

Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  4:
(RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xb4)
[0x5602e312a6f4]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  5:
(RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4d7)
[0x5602e3598d97]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  6:
/usr/bin/radosgw(+0x837248) [0x5602e32eb248]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  7:
(RGWRadosThread::Worker::entry()+0xbd) [0x5602e32ec83d]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  8:
/lib64/libc.so.6(+0x8a4da) [0x7fbd2b2394da]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  9:
clone()
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  NOTE: a
copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]: debug
 -9999> 2026-02-09T09:05:46.708+0000 7fbbf4c19640 -1 *** Caught signal
(Segmentation fault) **
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  in
thread 7fbbf4c19640 thread_name:data-sync
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  ceph
version 19.2.3 (c92aebb279828e9c3c1f5d24613efca272649e62) squid (stable)
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  1:
/lib64/libc.so.6(+0x3ebf0) [0x7fbd2b1edbf0]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  2:
(RGWCoroutinesStack::operate(DoutPrefixProvider const*,
RGWCoroutinesEnv*)+0x37) [0x5602e3127c97]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  3:
(RGWCoroutinesManager::run(DoutPrefixProvider const*,
std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*>
>&)+0x481) [0x5602e3129901]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  4:
(RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xb4)
[0x5602e312a6f4]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  5:
(RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4d7)
[0x5602e3598d97]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  6:
/usr/bin/radosgw(+0x837248) [0x5602e32eb248]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  7:
(RGWRadosThread::Worker::entry()+0xbd) [0x5602e32ec83d]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  8:
/lib64/libc.so.6(+0x8a4da) [0x7fbd2b2394da]
Feb 09 09:05:46 ceph-mon10001.dc10.maas.etraveli.io bash[2404887]:  9:
clone()

Additionally, when I restart the RGW sync enabled service on the newly
added zone, the daemon crashes. If I restart the RGW sync enabled daemon
directly, it also crashes.

To restart RGW sync enabled successfully, I have to run:
ceph config set client.rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg
rgw_run_sync_thread false
ceph orch daemon restart rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg
ceph config set client.rgw.s3-cdn-colocate.ceph-mon10001.rrbzdg
rgw_run_sync_thread true

I saw that this bug (https://tracker.ceph.com/issues/63378) was resolved
and backported to 19.2.3. However, I am still observing the same behavior.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] RGW/multisite: Segmentation fault during full sync

Reply via email to