Re: [ansible-project] EC2 Rolling Deploy with an ASG

Ben Whaley Mon, 24 Nov 2014 09:56:00 -0800

Opened issue 383. Thanks again.

https://github.com/ansible/ansible-modules-core/issues/383


- Ben

On Monday, November 24, 2014 9:49:45 AM UTC-8, James Martin wrote:
>
> Hmm..  I wonder if we need to have the ec2_asg module to wait for the 
> healthCheckGracePeriod to expire before checking the instance health 
> (assuming that it is an ELB health check type).  Even with a health check 
> grace period of one, there is a chance that the instance can become healthy 
> in that time.  Can you please open a bug on 
> github.com/ansible/ansible-modules-core to track this?  
>
> Thanks,
>
> James
>
> On Monday, November 24, 2014 12:44:21 PM UTC-5, Ben Whaley wrote:
>>
>> Hi James,
>>
>> Thanks for your reply.
>>
>> Interesting point about the HealthCheckGracePeriod option. I wasn't aware 
>> of its role here. I am indeed using it, in fact according to the docs it is 
>> a required option for ELB health checks. I had it set to 180, and I just 
>> tried it with lower values of 10 and 1 second. In both cases the behavior 
>> is the same: the autoscale group considers the instances healthy (because 
>> of the grace period, even at the lower value) and as a result ansible moves 
>> on before the instances are InService in the ELB. Even with the 
>> HealthCheckGracePeriod at the lowest possible value of 1 second, a race 
>> exists between the module's health check and the ELB grace period.
>>
>> I've worked around this for now with a script that does the following:
>> - Find the instances in the ASG
>> - Check the ELB to determine if they are healthy or not
>> - Exit 1 if not, 0 if yes
>>
>> Then I use an ansible task with an "until" loop to check the return code. 
>> The script is here:
>>
>> https://gist.github.com/anonymous/05e99828848ee565ed33
>>
>> Happy to work this in to an ansible module if you think this is useful. 
>> Or did I misunderstand the point about the health check grace period?
>>
>> Thanks,
>> Ben
>>
>>
>> On Monday, November 24, 2014 7:25:58 AM UTC-8, James Martin wrote:
>>>
>>> Ben,
>>>
>>> Thanks for the question.    Considering this: 
>>> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html,
>>>  
>>>   "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 
>>> action DescribeInstanceStatus return any state other than running, the 
>>> system status shows impaired, or the calls to Elastic Load Balancing action 
>>> DescribeInstanceHealth returns OutOfService in the instance state field."
>>>
>>> For determining the instance health status, we are fetching an ASG 
>>> object in boto and checking the health_status attribute for each instance 
>>> in the ASG, which are equal to either "healthy" or "unhealthy".  Are you 
>>> using an instance grace period option for the ELB? 
>>> http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html,
>>>  
>>> see HealthCheckGracePeriod.  This option is configurable with the 
>>> health_check_period setting found in the ec2_asg module.  By default it is 
>>> 500, and this would prematurely return the status of a healthy instance, as 
>>> it means it would mark any instance as healthy for 500 seconds.
>>>
>>> - James
>>>
>>>
>>>
>>>
>>>
>>> On Saturday, November 22, 2014 5:39:28 PM UTC-5, Ben Whaley wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Sorry for resurrecting an old thread, but wanted to mention my 
>>>> experience thus far using ec2_asg & ec2_lc for code deploys.
>>>>
>>>> I'm more or less following the methods described in this helpful repo
>>>>
>>>> https://github.com/ansible/immutablish-deploys
>>>>
>>>> I believe the dual_asg role is accepted as the more reliable method for 
>>>> deployments. If a deployment uses two ASGs, it's possible to just delete 
>>>> the new ASG and everything goes back to normal. This is the "Netflix" 
>>>> manner of releasing updates.
>>>>
>>>> The thing I'm finding though is that instances become "viable" well 
>>>> before they're actually InService in the ELB. From the ec2_asg code and by 
>>>> running ansible in verbose mode it's clear that ansible considers an 
>>>> instance viable once AWS indicates that instances are Healthy and 
>>>> InService. Checking via the AWS CLI tool, I can see that the ASG shows 
>>>> instances as Healthy and InService, but the ELB shows OutOfService. 
>>>>
>>>> The AWS docs are clear about the behavior of autoscale instances with 
>>>> health check type ELB: "For each call, if the Elastic Load Balancing 
>>>> action 
>>>> returns any state other than InService, the instance is marked as 
>>>> unhealthy." But this is not actually the case. 
>>>>
>>>> Has anyone else encountered this? Any suggested workarounds or fixes?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On Thursday, September 11, 2014 12:54:25 PM UTC-7, Scott Anderson wrote:
>>>>>
>>>>> On Sep 11, 2014, at 3:26 PM, James Martin <[email protected]> wrote:
>>>>>
>>>>> I think we’re probably going to move to a system that uses a tier of 
>>>>>> proxies and two ELBs. That way we can update the idle ELB, change out 
>>>>>> the 
>>>>>> AMIs, and bring the updated ELB up behind an alternate domain for the 
>>>>>> blue-green testing. Then when everything checks out, switch the proxies 
>>>>>> to 
>>>>>> the updated ELB and take down the remaining, now idle ELB.
>>>>>>
>>>>>>
>>>>> Not following this exactly -- what's your tier of proxies?  You have a 
>>>>> group of proxies (haproxy, nginx) behind a load balancer that point to 
>>>>> your 
>>>>> application?
>>>>>
>>>>>
>>>>> Yes, nginx or some other HA-ish thing. If it’s nginx then you can 
>>>>> maintain a brochure site even if something horrible happens to the 
>>>>> application.
>>>>>
>>>>>  
>>>>>
>>>>>> Amazon would suggest using Route53 to point to the new ELB, but 
>>>>>> there’s too great a chance of faulty DNS caching breaking a switch to a 
>>>>>> new 
>>>>>> ELB. Plus there’s a 60s TTL to start with regardless, even in the 
>>>>>> absence 
>>>>>> of caching.
>>>>>>
>>>>>
>>>>> Quite right.  There are some interesting things you can do with tools 
>>>>> you could run on the hosts that would redirect traffic from blue hosts to 
>>>>> the green LB, socat being one.  After you notice no more traffic coming 
>>>>> to 
>>>>> blue, you can terminate it.
>>>>>
>>>>>
>>>>> That’s an interesting idea, but it fails if people are behind a 
>>>>> caching DNS and they visit after you’ve terminated the blue traffic but 
>>>>> before their caching DNS lets go of the record.
>>>>>
>>>>> You're right, I did miss that.  By checking the AMI, you're only 
>>>>> updating the instance if the AMI changes.  If you a checking the launch 
>>>>> config, you are updating the instances if any component of the launch 
>>>>> config has changed -- AMI, instance type, address type, etc.
>>>>>
>>>>>
>>>>> That’s true, but if I’m changing instance types I’ll generally just 
>>>>> cycle_all. Because of the connection draining and parallelism of the 
>>>>> instance creation, it’s just as quick to do all of them instead of the 
>>>>> ones 
>>>>> that needs changing. That said, it’s an obvious optimization for sure.
>>>>>
>>>>>
>>>>> Using the ASG to do the provisioning might be preferable if it’s 
>>>>>> reliable. At first I went that route, but I was having problems with the 
>>>>>> ASG’s provisioning being non-deterministic. Manually creating the 
>>>>>> instances 
>>>>>> seems to ensure that things happen in a particular order and with 
>>>>>> predictable speed. As mentioned, the manual method definitely works 
>>>>>> every 
>>>>>> time, although I need to add some more timeout and error checking (like 
>>>>>> what happens if I ask for 3 new instances and only get 2).
>>>>>>
>>>>>>
>>>>> I didn't have any issues with the ASG doing the provisioning, but I 
>>>>> would say nothing is predictable with AWS :).  
>>>>>
>>>>>
>>>>> Very true. Over the past few months I’ve had several working processes 
>>>>> just fail with no warning. The most recent is AWS sometimes refusing to 
>>>>> return the current list of AMIs. Prior to that it was the Available 
>>>>> status 
>>>>> on an AMI not really meaning available. Now I check the list of returned 
>>>>> AMIs in a loop until the one I’m looking for shows up, Available status 
>>>>> notwithstanding. Very frustrating. Things could be worse, however: the 
>>>>> API 
>>>>> could be run by Facebook...
>>>>>
>>>>>
>>>>>> I have a separate task that cleans up the old AMIs and LCs, 
>>>>>> incidentally. I keep the most recent around as a backup for quick 
>>>>>> rollbacks.
>>>>>>
>>>>>
>>>>> That's cool, care to share?
>>>>>  
>>>>>
>>>>>
>>>>> I think I’ve posted it before, but here’s the important bit. After 
>>>>> deleting everything but the oldest backup AMI (determined by naming 
>>>>> convention or tags), delete any LC that doesn’t have an associated AMI:
>>>>>
>>>>> def delete_launch_configs(asg_connection, ec2_connection, module):
>>>>>     changed = False
>>>>>
>>>>>     launch_configs = asg_connection.get_all_launch_configurations()
>>>>>
>>>>>     for config in launch_configs:
>>>>>         image_id = config.image_id
>>>>>         images = ec2_connection.get_all_images(image_ids=[image_id])
>>>>>
>>>>>         if not images:
>>>>>             config.delete()
>>>>>             changed = True
>>>>>
>>>>>     module.exit_json(changed=changed)
>>>>>
>>>>>
>>>>> -scott
>>>>>
>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/4ff48246-663c-4dcf-aafe-78d6fe21314c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ansible-project] EC2 Rolling Deploy with an ASG

Reply via email to