Re: [ansible-project] EC2 Rolling Deploy with an ASG

Ben Whaley Mon, 24 Nov 2014 09:45:13 -0800

Hi James,

Thanks for your reply.


Interesting point about the HealthCheckGracePeriod option. I wasn't aware 
of its role here. I am indeed using it, in fact according to the docs it is 
a required option for ELB health checks. I had it set to 180, and I just 
tried it with lower values of 10 and 1 second. In both cases the behavior 
is the same: the autoscale group considers the instances healthy (because 
of the grace period, even at the lower value) and as a result ansible moves 
on before the instances are InService in the ELB. Even with the 
HealthCheckGracePeriod at the lowest possible value of 1 second, a race 
exists between the module's health check and the ELB grace period.

I've worked around this for now with a script that does the following:
- Find the instances in the ASG
- Check the ELB to determine if they are healthy or not
- Exit 1 if not, 0 if yes

Then I use an ansible task with an "until" loop to check the return code. 
The script is here:

https://gist.github.com/anonymous/05e99828848ee565ed33

Happy to work this in to an ansible module if you think this is useful. Or 
did I misunderstand the point about the health check grace period?

Thanks,
Ben


On Monday, November 24, 2014 7:25:58 AM UTC-8, James Martin wrote:
>
> Ben,
>
> Thanks for the question.    Considering this: 
> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html,
>  
>   "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 
> action DescribeInstanceStatus return any state other than running, the 
> system status shows impaired, or the calls to Elastic Load Balancing action 
> DescribeInstanceHealth returns OutOfService in the instance state field."
>
> For determining the instance health status, we are fetching an ASG object 
> in boto and checking the health_status attribute for each instance in the 
> ASG, which are equal to either "healthy" or "unhealthy".  Are you using an 
> instance grace period option for the ELB? 
> http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html,
>  
> see HealthCheckGracePeriod.  This option is configurable with the 
> health_check_period setting found in the ec2_asg module.  By default it is 
> 500, and this would prematurely return the status of a healthy instance, as 
> it means it would mark any instance as healthy for 500 seconds.
>
> - James
>
>
>
>
>
> On Saturday, November 22, 2014 5:39:28 PM UTC-5, Ben Whaley wrote:
>>
>> Hi all,
>>
>> Sorry for resurrecting an old thread, but wanted to mention my experience 
>> thus far using ec2_asg & ec2_lc for code deploys.
>>
>> I'm more or less following the methods described in this helpful repo
>>
>> https://github.com/ansible/immutablish-deploys
>>
>> I believe the dual_asg role is accepted as the more reliable method for 
>> deployments. If a deployment uses two ASGs, it's possible to just delete 
>> the new ASG and everything goes back to normal. This is the "Netflix" 
>> manner of releasing updates.
>>
>> The thing I'm finding though is that instances become "viable" well 
>> before they're actually InService in the ELB. From the ec2_asg code and by 
>> running ansible in verbose mode it's clear that ansible considers an 
>> instance viable once AWS indicates that instances are Healthy and 
>> InService. Checking via the AWS CLI tool, I can see that the ASG shows 
>> instances as Healthy and InService, but the ELB shows OutOfService. 
>>
>> The AWS docs are clear about the behavior of autoscale instances with 
>> health check type ELB: "For each call, if the Elastic Load Balancing action 
>> returns any state other than InService, the instance is marked as 
>> unhealthy." But this is not actually the case. 
>>
>> Has anyone else encountered this? Any suggested workarounds or fixes?
>>
>> Thanks,
>> Ben
>>
>>
>> On Thursday, September 11, 2014 12:54:25 PM UTC-7, Scott Anderson wrote:
>>>
>>> On Sep 11, 2014, at 3:26 PM, James Martin <[email protected]> wrote:
>>>
>>> I think we’re probably going to move to a system that uses a tier of 
>>>> proxies and two ELBs. That way we can update the idle ELB, change out the 
>>>> AMIs, and bring the updated ELB up behind an alternate domain for the 
>>>> blue-green testing. Then when everything checks out, switch the proxies to 
>>>> the updated ELB and take down the remaining, now idle ELB.
>>>>
>>>>
>>> Not following this exactly -- what's your tier of proxies?  You have a 
>>> group of proxies (haproxy, nginx) behind a load balancer that point to your 
>>> application?
>>>
>>>
>>> Yes, nginx or some other HA-ish thing. If it’s nginx then you can 
>>> maintain a brochure site even if something horrible happens to the 
>>> application.
>>>
>>>  
>>>
>>>> Amazon would suggest using Route53 to point to the new ELB, but there’s 
>>>> too great a chance of faulty DNS caching breaking a switch to a new ELB. 
>>>> Plus there’s a 60s TTL to start with regardless, even in the absence of 
>>>> caching.
>>>>
>>>
>>> Quite right.  There are some interesting things you can do with tools 
>>> you could run on the hosts that would redirect traffic from blue hosts to 
>>> the green LB, socat being one.  After you notice no more traffic coming to 
>>> blue, you can terminate it.
>>>
>>>
>>> That’s an interesting idea, but it fails if people are behind a caching 
>>> DNS and they visit after you’ve terminated the blue traffic but before 
>>> their caching DNS lets go of the record.
>>>
>>> You're right, I did miss that.  By checking the AMI, you're only 
>>> updating the instance if the AMI changes.  If you a checking the launch 
>>> config, you are updating the instances if any component of the launch 
>>> config has changed -- AMI, instance type, address type, etc.
>>>
>>>
>>> That’s true, but if I’m changing instance types I’ll generally just 
>>> cycle_all. Because of the connection draining and parallelism of the 
>>> instance creation, it’s just as quick to do all of them instead of the ones 
>>> that needs changing. That said, it’s an obvious optimization for sure.
>>>
>>>
>>> Using the ASG to do the provisioning might be preferable if it’s 
>>>> reliable. At first I went that route, but I was having problems with the 
>>>> ASG’s provisioning being non-deterministic. Manually creating the 
>>>> instances 
>>>> seems to ensure that things happen in a particular order and with 
>>>> predictable speed. As mentioned, the manual method definitely works every 
>>>> time, although I need to add some more timeout and error checking (like 
>>>> what happens if I ask for 3 new instances and only get 2).
>>>>
>>>>
>>> I didn't have any issues with the ASG doing the provisioning, but I 
>>> would say nothing is predictable with AWS :).  
>>>
>>>
>>> Very true. Over the past few months I’ve had several working processes 
>>> just fail with no warning. The most recent is AWS sometimes refusing to 
>>> return the current list of AMIs. Prior to that it was the Available status 
>>> on an AMI not really meaning available. Now I check the list of returned 
>>> AMIs in a loop until the one I’m looking for shows up, Available status 
>>> notwithstanding. Very frustrating. Things could be worse, however: the API 
>>> could be run by Facebook...
>>>
>>>
>>>> I have a separate task that cleans up the old AMIs and LCs, 
>>>> incidentally. I keep the most recent around as a backup for quick 
>>>> rollbacks.
>>>>
>>>
>>> That's cool, care to share?
>>>  
>>>
>>>
>>> I think I’ve posted it before, but here’s the important bit. After 
>>> deleting everything but the oldest backup AMI (determined by naming 
>>> convention or tags), delete any LC that doesn’t have an associated AMI:
>>>
>>> def delete_launch_configs(asg_connection, ec2_connection, module):
>>>     changed = False
>>>
>>>     launch_configs = asg_connection.get_all_launch_configurations()
>>>
>>>     for config in launch_configs:
>>>         image_id = config.image_id
>>>         images = ec2_connection.get_all_images(image_ids=[image_id])
>>>
>>>         if not images:
>>>             config.delete()
>>>             changed = True
>>>
>>>     module.exit_json(changed=changed)
>>>
>>>
>>> -scott
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/bb5ec0f5-3c6a-4b0f-8950-ac05a3450641%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [ansible-project] EC2 Rolling Deploy with an ASG

Reply via email to