Well, I was hoping for a reply from Inktank, but I'll describe the process
I plan to test:

Best Case:
Primary zone is down
Disable radosgw-agent in secondary zone
Update the region in the secondary to enable data and metadata logging
Update DNS/Load balancer to send primary traffic to secondary
Secondary is now the primary

When the old primary comes back online, don't write to it
Update the old primary's zone configs to configure it as a secondary
Setup radosgw-agent, and start replication
Wait for replication to catch up.

I think this will work, as long as you enable the data and metadata logging
in the secondary before you start writing to it.  Once the new secondary
has caught up on replication, you can repeat the process to promote the new
secondary to the new new master.



Worst Case:
Primary zone is down
Disable radosgw-agent in secondary zone
Update DNS/Load balancer to send primary traffic to secondary
Secondary is now the primary

When the old primary comes back online, drop all of it's rgw pools
Rebuild the old primary as the new secondary
Setup radosgw-agent, and start replication
Wait a long time for replication to catch up.

That's pretty extreme, but it should work.



As far as replication delay, I'm not aware of any tunables that would do
that.  It's an alert you should setup in your monitoring tool.  Time based
delay is harder to do; you'd have to setup a heartbeat file that you could
watch.  Bytewise, you can monitor the replication backlog by summing up the
totals of radosgw-admin bucket stats in both zones.

I do know that replication can get pretty far behind and still catch up.  I
deliberately imported into the primary faster than I could replicate.  In
the end, I ended up with the primary about 20 TiB ahead of the of the
secondary.  It's still catching up weeks later (I only have a 200 Mbps
link).

There are some problems when the secondary gets that far behind.  If you're
using an older radosgw-agent, it might stop replicating buckets because
there wasn't any new write activity.  I wrote a bash script that uploads a
0 byte file to every bucket every 10 minutes.  If you're using a newer
radosgw-agent, it works around that, but it doesn't persist it's progress.
 Restarting the radosgw-agent makes it start over.  Depending on how big
your replication backlog, letting radosgw-agent run uninterrupted to
completion may or may not be a problem.




On Tue, Jun 17, 2014 at 3:37 PM, Fred Yang <frederic.y...@gmail.com> wrote:

> I have been looking for documents regarding DR procedure for Federated
> Gateway as well and not much luck. Can somebody from Inktank comment on
> that?
> In the event of site failure, what's the current procedure to switch
> master/secondary zone role? or Ceph currently does not have that capability
> yet? If that's the case, any roadmap to add that in future release?
>
> Also, for data sync from master to secondary, are there any parameter to
> control the maximum amount of data or time window that secondary zone can
> be lagging behind?
>
> Thanks,
> Fred
> On Jun 17, 2014 4:46 PM, "Craig Lewis" <cle...@centraldesktop.com> wrote:
>
>> Metadata replication is about keeping a global namespace in all zones.
>>  It will replicate all of your users and bucket names, but not the data
>> itself.  That way you don't end up with a bucket named "mybucket" in your
>> US and EU zones that are owned by different people.  It's up to you to
>> decide if this is something you want or not.  Metadata replication won't
>> protected against the primary zone going offline.
>>
>> Data replication will copy the metadata and data.  If the primary goes
>> offline, you'll be able to read everything that has replicated to the
>> secondary zone.  You should make sure you have enough bandwidth between the
>> zones (and that latency is low enough) to allow replication can keep up.
>>  If replication falls behind, anything not replicated will catch up when
>> the primary comes back up.
>>
>> I haven't found any docs on the process to promote a secondary zone to
>> primary.  Right now, it doesn't look like a good idea.  If the master goes
>> offline, you can read from the secondary while you get the master back
>> online.  The failover/failback are expensive (time and bandwidth wise), so
>> it would take a pretty big problem before it's a good idea to promote the
>> secondary to primary.
>>
>>
>>
>> Regarding your FastCGI error, when I see that, it's because my RadosGW
>> daemon isn't running.  Check if it's running (`ps auxww | grep radosgw`).
>>  If it's not, try `start radosgw-all`, then restart apache.  If that
>> doesn't work, you might need some extra configs in ceph.conf.
>>
>>
>>
>> Wido den Hollander just posted some WSGI examples in a thread titled
>> "REST API and uWSGI?"  If you're still interested in getting WSGI to work,
>> check th
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to