I have a ​3 zone multi-site setup using Ceph luminous (12.2.4) on Ubuntu
18.04. I used ceph-deploy to build each cluster and followed
for multi-site setup and have purged the multi-zone config (ie purged all
pools and started from scratch) twice now. This is a testing setup so no
data has been lost.

On initial setup, I have no issues. Metadata & normal data sync as expected
and works great. As soon as I promote a different zone to master though,
metadata sync is broken on both secondaries. I'll note that data sync still
works, as object counts in each zones $ZONE.rgw.buckets.data grow even when
I create new buckets and push data to them.  I've narrowed the lack of
metadata sync down to a 500 being returned by the rados gateway in the
master zone when secondaries are requesting
/admin/log?type=metadata&rgwx-zonegroup=$ZONEGROUPID (by calling
radosgw-admin metadata sync run)

Associated logs on the rgw instance in the master zone:

2018-05-16 06:25:17.518662 7f8957499700  1 ====== starting new request
req=0x7f89574931e0 =====
2018-05-16 06:25:17.520186 7f8957499700  1 failed to decode the mdlog
history: buffer::end_of_buffer
2018-05-16 06:25:17.520195 7f8957499700  1 failed to read mdlog history:
(5) Input/output error
2018-05-16 06:25:17.520207 7f8957499700  0 WARNING: set_req_state_err
err_no=5 resorting to 500
2018-05-16 06:25:17.520263 7f8957499700  1 ====== req done
req=0x7f89574931e0 op status=0 http_status=500 ======
2018-05-16 06:25:17.520314 7f8957499700  1 civetweb: 0x55a584cb9000:

However, an 'radosgw-admin mdlog list' works just fine, and returns what
appears to be a perfectly valid log.​

I have ensured that each of the secondaries has pulled the latest period
and restarted all the gateways. All rgw instances agree on the current
period as well as the master zone.

Any ideas on what may be going on?​
