cbonami commented on issue #1193: Backup/Restore Mechanism for BookKeeper URL: https://github.com/apache/bookkeeper/issues/1193#issuecomment-368582297 Ok, so thinking out loud here. Let me maybe first recap my situation, which actually lead to the creation of this github issue: ``` I run my 3 bookies in an AutoScalingGroup on AWS/EC2 that keeps the number of bookies to at least 3; when a bookie goes 'down' (codahale on 9001 does not return 'pong'), the bookie ?nd its EBS-disks(!) are removed/deleted and replaced by new bookie/disk. Rereplicator service kicks in and repopulates the bookie. So far so good, but what should I do to recover from the (rare) calamity that all bookies are killed and replaced (this is not unthinkable: connection with ZK-cluster lost will make the codahale metric fail and cause a meltdown/rotation of the *whole* BK cluster) ? Or a human error (somebody deletes the cloudformation stack by mistake; it already happened) ? In such a disaster scenario, I presume, I need to have a backup somewhere, not? ``` I could also choose to not delete the EBS-volume, snapshot it on a daily/hourly basis, and re-attach it to A NEW bookie-server that is created by AWS after it has killed the unresponsive bookie. Of course, this would require some ingenious scripting from my side (happening during startup of the ec2-node i.e. in the LaunchConfig), as AWS doesn't do reattachment automatically in case of AutoScaling-groups. And I can only re-attach volumes that are in the same Availibility Zone. Also, I guess I need to make absolutely sure that the new bookie node registers itself under exactly the _same_ name as the deleted one, otherwise I am reattaching EBS-volumes, but BK doesn't know what to do with it. Starts to feel complicated. And what if I scale up from 3 to 9 bookies - what is the impact/implications of that? Or, scale down, ... ? Another approach would be to abandon the AutoscalingGroup-approach and the convenience it brings me (scaling up can be done automatically and/or by a few clicks in the AWS dashboard). And rely on the autorecovery-feature for EC2-instances. When an instance goes down, AWS will recover it elsewhere in the same AZ, and reattach the old EBS-volume to it. I presume the new instance will have exactly the same hostname (used for registering/identifying the bookie in ZK), and reattaching the EBS volume will be automatic. But then I end up with a quite static configuration, where, whenever I want to scale up/down, I need to execute a cloudformation-template, creating a new 'stack' for every bookie I want to deploy. Downscaling would mean deleting a stack. Quite clunky, not ? But even when I succeed reattaching volumes etc -- I'm thinking ... isn't there still a way that I can still lose data, maybe due to a problem on ZK-level? Imagine somebody kills the ZK-cluster 'by mistake' or deletes the directory for my namespace -- if I recover from a ZK-snapshot of an hour ago (while my data on the bookies has still evolved), will I be in a corrupt state ? Maybe exhibitor can save me there ... Hmm... maybe I should do an implicit backup: I use a separate bookie/EC2-instance, outside the AutoScalingGroup, that will autorestore whenever it goes down. The only purpose of the bookie (1 or more) is to tail all logs in all namespaces, and stream the data to files on S3 buckets (1 folder per namespace/log/whatever). In case of a catastrophe, I need to read ALL the data back and upload it to a fresh bookie-cluster -- but can I really do that preserving timestamps and other kinds of metadata that I want to restore to its initial state ?? Is this a viable strategy ??
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
