cbonami commented on issue #1193: Backup/Restore Mechanism for BookKeeper
URL: https://github.com/apache/bookkeeper/issues/1193#issuecomment-368582297
 
 
   Ok, so thinking out loud here.
   Let me maybe first recap my situation, which actually lead to the creation 
of this github issue:
   
   _I run my 3 bookies in an AutoScalingGroup on AWS/EC2 that keeps the number 
of bookies to at least 3; when a bookie goes 'down' (codahale on 9001 does not 
return 'pong'), the bookie ?nd its EBS-disks(!) are removed/deleted and 
replaced by new bookie/disk. Rereplicator service kicks in and repopulates the 
bookie. So far so good, but what should I do to recover from the (rare) 
calamity that all bookies are killed and replaced (this is not unthinkable: 
connection with ZK-cluster lost will make the codahale metric fail and cause a 
meltdown/rotation of the *whole* BK cluster) ? Or a human error (somebody 
deletes the cloudformation stack by mistake; it already happened) ? In such a 
disaster scenario, I presume, I need to have a backup somewhere, not?_
   
   I could also choose to not delete the EBS-volume, snapshot it on a 
daily/hourly basis, and re-attach it to A NEW bookie-server that is created by 
AWS after it has killed the unresponsive bookie. Of course, this would require 
some ingenious scripting from my side (happening during startup of the ec2-node 
i.e. in the LaunchConfig), as AWS doesn't do reattachment automatically in case 
of AutoScaling-groups. And I can only re-attach volumes that are in the same 
Availibility Zone. Also, I guess I need to make absolutely sure that the new 
bookie node registers itself under exactly the _same_ name as the deleted one, 
otherwise I am reattaching EBS-volumes, but BK doesn't know what to do with it. 
   Starts to feel complicated. And what if I scale up from 3 to 9 bookies - 
what is the impact/implications of that? Or, scale down, ... ?
   
   Another approach would be to abandon the AutoscalingGroup-approach and the 
convenience it brings me (scaling up can be done automatically and/or by a few 
clicks in the AWS dashboard). And rely on the autorecovery-feature for 
EC2-instances. When an instance goes down, AWS will recover it elsewhere in the 
same AZ, and reattach the old EBS-volume to it. I presume the new instance will 
have exactly the same hostname (used for registering/identifying the bookie in 
ZK), and reattaching the EBS volume will be automatic. But then I end up with a 
quite static configuration, where, whenever I want to scale up/down, I need to 
execute a cloudformation-template, creating a new 'stack' for every bookie I 
want to deploy. Downscaling would mean deleting a stack. Quite clunky, not ?
   
   But even when I succeed reattaching volumes etc -- I'm thinking ... isn't 
there still a way that I can still lose data, maybe due to a problem on 
ZK-level? Imagine somebody kills the ZK-cluster 'by mistake' or deletes the 
directory for my namespace -- if I recover from a ZK-snapshot of an hour ago 
(while my data on the bookies has still evolved), will I be in a corrupt state 
? Maybe exhibitor can save me there ...
   
   Hmm... maybe I should do an implicit backup: I use a separate 
bookie/EC2-instance, outside the AutoScalingGroup, that will autorestore 
whenever it goes down. The only purpose of the bookie (1 or more) is to tail 
all logs in all namespaces, and stream the data to files on S3 buckets (1 
folder per namespace/log/whatever). In case of a catastrophe, I need to read 
ALL the data back and upload it to a fresh bookie-cluster -- but can I really 
do that preserving timestamps and other kinds of metadata that I want to 
restore to its initial state ?? Is this a viable strategy ??

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to