Opened issue 383. Thanks again. https://github.com/ansible/ansible-modules-core/issues/383
- Ben On Monday, November 24, 2014 9:49:45 AM UTC-8, James Martin wrote: > > Hmm.. I wonder if we need to have the ec2_asg module to wait for the > healthCheckGracePeriod to expire before checking the instance health > (assuming that it is an ELB health check type). Even with a health check > grace period of one, there is a chance that the instance can become healthy > in that time. Can you please open a bug on > github.com/ansible/ansible-modules-core to track this? > > Thanks, > > James > > On Monday, November 24, 2014 12:44:21 PM UTC-5, Ben Whaley wrote: >> >> Hi James, >> >> Thanks for your reply. >> >> Interesting point about the HealthCheckGracePeriod option. I wasn't aware >> of its role here. I am indeed using it, in fact according to the docs it is >> a required option for ELB health checks. I had it set to 180, and I just >> tried it with lower values of 10 and 1 second. In both cases the behavior >> is the same: the autoscale group considers the instances healthy (because >> of the grace period, even at the lower value) and as a result ansible moves >> on before the instances are InService in the ELB. Even with the >> HealthCheckGracePeriod at the lowest possible value of 1 second, a race >> exists between the module's health check and the ELB grace period. >> >> I've worked around this for now with a script that does the following: >> - Find the instances in the ASG >> - Check the ELB to determine if they are healthy or not >> - Exit 1 if not, 0 if yes >> >> Then I use an ansible task with an "until" loop to check the return code. >> The script is here: >> >> https://gist.github.com/anonymous/05e99828848ee565ed33 >> >> Happy to work this in to an ansible module if you think this is useful. >> Or did I misunderstand the point about the health check grace period? >> >> Thanks, >> Ben >> >> >> On Monday, November 24, 2014 7:25:58 AM UTC-8, James Martin wrote: >>> >>> Ben, >>> >>> Thanks for the question. Considering this: >>> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html, >>> >>> "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 >>> action DescribeInstanceStatus return any state other than running, the >>> system status shows impaired, or the calls to Elastic Load Balancing action >>> DescribeInstanceHealth returns OutOfService in the instance state field." >>> >>> For determining the instance health status, we are fetching an ASG >>> object in boto and checking the health_status attribute for each instance >>> in the ASG, which are equal to either "healthy" or "unhealthy". Are you >>> using an instance grace period option for the ELB? >>> http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html, >>> >>> see HealthCheckGracePeriod. This option is configurable with the >>> health_check_period setting found in the ec2_asg module. By default it is >>> 500, and this would prematurely return the status of a healthy instance, as >>> it means it would mark any instance as healthy for 500 seconds. >>> >>> - James >>> >>> >>> >>> >>> >>> On Saturday, November 22, 2014 5:39:28 PM UTC-5, Ben Whaley wrote: >>>> >>>> Hi all, >>>> >>>> Sorry for resurrecting an old thread, but wanted to mention my >>>> experience thus far using ec2_asg & ec2_lc for code deploys. >>>> >>>> I'm more or less following the methods described in this helpful repo >>>> >>>> https://github.com/ansible/immutablish-deploys >>>> >>>> I believe the dual_asg role is accepted as the more reliable method for >>>> deployments. If a deployment uses two ASGs, it's possible to just delete >>>> the new ASG and everything goes back to normal. This is the "Netflix" >>>> manner of releasing updates. >>>> >>>> The thing I'm finding though is that instances become "viable" well >>>> before they're actually InService in the ELB. From the ec2_asg code and by >>>> running ansible in verbose mode it's clear that ansible considers an >>>> instance viable once AWS indicates that instances are Healthy and >>>> InService. Checking via the AWS CLI tool, I can see that the ASG shows >>>> instances as Healthy and InService, but the ELB shows OutOfService. >>>> >>>> The AWS docs are clear about the behavior of autoscale instances with >>>> health check type ELB: "For each call, if the Elastic Load Balancing >>>> action >>>> returns any state other than InService, the instance is marked as >>>> unhealthy." But this is not actually the case. >>>> >>>> Has anyone else encountered this? Any suggested workarounds or fixes? >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>> On Thursday, September 11, 2014 12:54:25 PM UTC-7, Scott Anderson wrote: >>>>> >>>>> On Sep 11, 2014, at 3:26 PM, James Martin <[email protected]> wrote: >>>>> >>>>> I think we’re probably going to move to a system that uses a tier of >>>>>> proxies and two ELBs. That way we can update the idle ELB, change out >>>>>> the >>>>>> AMIs, and bring the updated ELB up behind an alternate domain for the >>>>>> blue-green testing. Then when everything checks out, switch the proxies >>>>>> to >>>>>> the updated ELB and take down the remaining, now idle ELB. >>>>>> >>>>>> >>>>> Not following this exactly -- what's your tier of proxies? You have a >>>>> group of proxies (haproxy, nginx) behind a load balancer that point to >>>>> your >>>>> application? >>>>> >>>>> >>>>> Yes, nginx or some other HA-ish thing. If it’s nginx then you can >>>>> maintain a brochure site even if something horrible happens to the >>>>> application. >>>>> >>>>> >>>>> >>>>>> Amazon would suggest using Route53 to point to the new ELB, but >>>>>> there’s too great a chance of faulty DNS caching breaking a switch to a >>>>>> new >>>>>> ELB. Plus there’s a 60s TTL to start with regardless, even in the >>>>>> absence >>>>>> of caching. >>>>>> >>>>> >>>>> Quite right. There are some interesting things you can do with tools >>>>> you could run on the hosts that would redirect traffic from blue hosts to >>>>> the green LB, socat being one. After you notice no more traffic coming >>>>> to >>>>> blue, you can terminate it. >>>>> >>>>> >>>>> That’s an interesting idea, but it fails if people are behind a >>>>> caching DNS and they visit after you’ve terminated the blue traffic but >>>>> before their caching DNS lets go of the record. >>>>> >>>>> You're right, I did miss that. By checking the AMI, you're only >>>>> updating the instance if the AMI changes. If you a checking the launch >>>>> config, you are updating the instances if any component of the launch >>>>> config has changed -- AMI, instance type, address type, etc. >>>>> >>>>> >>>>> That’s true, but if I’m changing instance types I’ll generally just >>>>> cycle_all. Because of the connection draining and parallelism of the >>>>> instance creation, it’s just as quick to do all of them instead of the >>>>> ones >>>>> that needs changing. That said, it’s an obvious optimization for sure. >>>>> >>>>> >>>>> Using the ASG to do the provisioning might be preferable if it’s >>>>>> reliable. At first I went that route, but I was having problems with the >>>>>> ASG’s provisioning being non-deterministic. Manually creating the >>>>>> instances >>>>>> seems to ensure that things happen in a particular order and with >>>>>> predictable speed. As mentioned, the manual method definitely works >>>>>> every >>>>>> time, although I need to add some more timeout and error checking (like >>>>>> what happens if I ask for 3 new instances and only get 2). >>>>>> >>>>>> >>>>> I didn't have any issues with the ASG doing the provisioning, but I >>>>> would say nothing is predictable with AWS :). >>>>> >>>>> >>>>> Very true. Over the past few months I’ve had several working processes >>>>> just fail with no warning. The most recent is AWS sometimes refusing to >>>>> return the current list of AMIs. Prior to that it was the Available >>>>> status >>>>> on an AMI not really meaning available. Now I check the list of returned >>>>> AMIs in a loop until the one I’m looking for shows up, Available status >>>>> notwithstanding. Very frustrating. Things could be worse, however: the >>>>> API >>>>> could be run by Facebook... >>>>> >>>>> >>>>>> I have a separate task that cleans up the old AMIs and LCs, >>>>>> incidentally. I keep the most recent around as a backup for quick >>>>>> rollbacks. >>>>>> >>>>> >>>>> That's cool, care to share? >>>>> >>>>> >>>>> >>>>> I think I’ve posted it before, but here’s the important bit. After >>>>> deleting everything but the oldest backup AMI (determined by naming >>>>> convention or tags), delete any LC that doesn’t have an associated AMI: >>>>> >>>>> def delete_launch_configs(asg_connection, ec2_connection, module): >>>>> changed = False >>>>> >>>>> launch_configs = asg_connection.get_all_launch_configurations() >>>>> >>>>> for config in launch_configs: >>>>> image_id = config.image_id >>>>> images = ec2_connection.get_all_images(image_ids=[image_id]) >>>>> >>>>> if not images: >>>>> config.delete() >>>>> changed = True >>>>> >>>>> module.exit_json(changed=changed) >>>>> >>>>> >>>>> -scott >>>>> >>>>> >>>>> -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/4ff48246-663c-4dcf-aafe-78d6fe21314c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
