Re: [ansible-project] EC2 Rolling Deploy with an ASG

James Martin Mon, 24 Nov 2014 09:50:34 -0800

Hmm..  I wonder if we need to have the ec2_asg module to wait for the 
healthCheckGracePeriod to expire before checking the instance health 
(assuming that it is an ELB health check type).  Even with a health check 
grace period of one, there is a chance that the instance can become healthy 
in that time.  Can you please open a bug on 
github.com/ansible/ansible-modules-core to track this?


Thanks,

James

On Monday, November 24, 2014 12:44:21 PM UTC-5, Ben Whaley wrote:
>
> Hi James,
>
> Thanks for your reply.
>
> Interesting point about the HealthCheckGracePeriod option. I wasn't aware 
> of its role here. I am indeed using it, in fact according to the docs it is 
> a required option for ELB health checks. I had it set to 180, and I just 
> tried it with lower values of 10 and 1 second. In both cases the behavior 
> is the same: the autoscale group considers the instances healthy (because 
> of the grace period, even at the lower value) and as a result ansible moves 
> on before the instances are InService in the ELB. Even with the 
> HealthCheckGracePeriod at the lowest possible value of 1 second, a race 
> exists between the module's health check and the ELB grace period.
>
> I've worked around this for now with a script that does the following:
> - Find the instances in the ASG
> - Check the ELB to determine if they are healthy or not
> - Exit 1 if not, 0 if yes
>
> Then I use an ansible task with an "until" loop to check the return code. 
> The script is here:
>
> https://gist.github.com/anonymous/05e99828848ee565ed33
>
> Happy to work this in to an ansible module if you think this is useful. Or 
> did I misunderstand the point about the health check grace period?
>
> Thanks,
> Ben
>
>
> On Monday, November 24, 2014 7:25:58 AM UTC-8, James Martin wrote:
>>
>> Ben,
>>
>> Thanks for the question.    Considering this: 
>> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html,
>>  
>>   "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 
>> action DescribeInstanceStatus return any state other than running, the 
>> system status shows impaired, or the calls to Elastic Load Balancing action 
>> DescribeInstanceHealth returns OutOfService in the instance state field."
>>
>> For determining the instance health status, we are fetching an ASG object 
>> in boto and checking the health_status attribute for each instance in the 
>> ASG, which are equal to either "healthy" or "unhealthy".  Are you using an 
>> instance grace period option for the ELB? 
>> http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html,
>>  
>> see HealthCheckGracePeriod.  This option is configurable with the 
>> health_check_period setting found in the ec2_asg module.  By default it is 
>> 500, and this would prematurely return the status of a healthy instance, as 
>> it means it would mark any instance as healthy for 500 seconds.
>>
>> - James
>>
>>
>>
>>
>>
>> On Saturday, November 22, 2014 5:39:28 PM UTC-5, Ben Whaley wrote:
>>>
>>> Hi all,
>>>
>>> Sorry for resurrecting an old thread, but wanted to mention my 
>>> experience thus far using ec2_asg & ec2_lc for code deploys.
>>>
>>> I'm more or less following the methods described in this helpful repo
>>>
>>> https://github.com/ansible/immutablish-deploys
>>>
>>> I believe the dual_asg role is accepted as the more reliable method for 
>>> deployments. If a deployment uses two ASGs, it's possible to just delete 
>>> the new ASG and everything goes back to normal. This is the "Netflix" 
>>> manner of releasing updates.
>>>
>>> The thing I'm finding though is that instances become "viable" well 
>>> before they're actually InService in the ELB. From the ec2_asg code and by 
>>> running ansible in verbose mode it's clear that ansible considers an 
>>> instance viable once AWS indicates that instances are Healthy and 
>>> InService. Checking via the AWS CLI tool, I can see that the ASG shows 
>>> instances as Healthy and InService, but the ELB shows OutOfService. 
>>>
>>> The AWS docs are clear about the behavior of autoscale instances with 
>>> health check type ELB: "For each call, if the Elastic Load Balancing action 
>>> returns any state other than InService, the instance is marked as 
>>> unhealthy." But this is not actually the case. 
>>>
>>> Has anyone else encountered this? Any suggested workarounds or fixes?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Thursday, September 11, 2014 12:54:25 PM UTC-7, Scott Anderson wrote:
>>>>
>>>> On Sep 11, 2014, at 3:26 PM, James Martin <[email protected]> wrote:
>>>>
>>>> I think we’re probably going to move to a system that uses a tier of 
>>>>> proxies and two ELBs. That way we can update the idle ELB, change out the 
>>>>> AMIs, and bring the updated ELB up behind an alternate domain for the 
>>>>> blue-green testing. Then when everything checks out, switch the proxies 
>>>>> to 
>>>>> the updated ELB and take down the remaining, now idle ELB.
>>>>>
>>>>>
>>>> Not following this exactly -- what's your tier of proxies?  You have a 
>>>> group of proxies (haproxy, nginx) behind a load balancer that point to 
>>>> your 
>>>> application?
>>>>
>>>>
>>>> Yes, nginx or some other HA-ish thing. If it’s nginx then you can 
>>>> maintain a brochure site even if something horrible happens to the 
>>>> application.
>>>>
>>>>  
>>>>
>>>>> Amazon would suggest using Route53 to point to the new ELB, but 
>>>>> there’s too great a chance of faulty DNS caching breaking a switch to a 
>>>>> new 
>>>>> ELB. Plus there’s a 60s TTL to start with regardless, even in the absence 
>>>>> of caching.
>>>>>
>>>>
>>>> Quite right.  There are some interesting things you can do with tools 
>>>> you could run on the hosts that would redirect traffic from blue hosts to 
>>>> the green LB, socat being one.  After you notice no more traffic coming to 
>>>> blue, you can terminate it.
>>>>
>>>>
>>>> That’s an interesting idea, but it fails if people are behind a caching 
>>>> DNS and they visit after you’ve terminated the blue traffic but before 
>>>> their caching DNS lets go of the record.
>>>>
>>>> You're right, I did miss that.  By checking the AMI, you're only 
>>>> updating the instance if the AMI changes.  If you a checking the launch 
>>>> config, you are updating the instances if any component of the launch 
>>>> config has changed -- AMI, instance type, address type, etc.
>>>>
>>>>
>>>> That’s true, but if I’m changing instance types I’ll generally just 
>>>> cycle_all. Because of the connection draining and parallelism of the 
>>>> instance creation, it’s just as quick to do all of them instead of the 
>>>> ones 
>>>> that needs changing. That said, it’s an obvious optimization for sure.
>>>>
>>>>
>>>> Using the ASG to do the provisioning might be preferable if it’s 
>>>>> reliable. At first I went that route, but I was having problems with the 
>>>>> ASG’s provisioning being non-deterministic. Manually creating the 
>>>>> instances 
>>>>> seems to ensure that things happen in a particular order and with 
>>>>> predictable speed. As mentioned, the manual method definitely works every 
>>>>> time, although I need to add some more timeout and error checking (like 
>>>>> what happens if I ask for 3 new instances and only get 2).
>>>>>
>>>>>
>>>> I didn't have any issues with the ASG doing the provisioning, but I 
>>>> would say nothing is predictable with AWS :).  
>>>>
>>>>
>>>> Very true. Over the past few months I’ve had several working processes 
>>>> just fail with no warning. The most recent is AWS sometimes refusing to 
>>>> return the current list of AMIs. Prior to that it was the Available status 
>>>> on an AMI not really meaning available. Now I check the list of returned 
>>>> AMIs in a loop until the one I’m looking for shows up, Available status 
>>>> notwithstanding. Very frustrating. Things could be worse, however: the API 
>>>> could be run by Facebook...
>>>>
>>>>
>>>>> I have a separate task that cleans up the old AMIs and LCs, 
>>>>> incidentally. I keep the most recent around as a backup for quick 
>>>>> rollbacks.
>>>>>
>>>>
>>>> That's cool, care to share?
>>>>  
>>>>
>>>>
>>>> I think I’ve posted it before, but here’s the important bit. After 
>>>> deleting everything but the oldest backup AMI (determined by naming 
>>>> convention or tags), delete any LC that doesn’t have an associated AMI:
>>>>
>>>> def delete_launch_configs(asg_connection, ec2_connection, module):
>>>>     changed = False
>>>>
>>>>     launch_configs = asg_connection.get_all_launch_configurations()
>>>>
>>>>     for config in launch_configs:
>>>>         image_id = config.image_id
>>>>         images = ec2_connection.get_all_images(image_ids=[image_id])
>>>>
>>>>         if not images:
>>>>             config.delete()
>>>>             changed = True
>>>>
>>>>     module.exit_json(changed=changed)
>>>>
>>>>
>>>> -scott
>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/269ec5ca-4371-456f-92bd-7e2b370e7516%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ansible-project] EC2 Rolling Deploy with an ASG

Reply via email to