On Sat, Aug 06, 2011 at 02:42:45AM +0300, Piavlo wrote:
>  Well certainly aws has it's limitations which force you to design a 
> very different infrastructure than you would in normal datacenter 
> environment.
> IMHO this is the great thing about those limitations as you are forced 
> to start thinking differently and end up using a set of well known and 
> established tools to
> overcome those limitations.

Surely, getting rid of everything which had worked fine for ages and
limiting oneself to use lazy and naive approaches like DNS because
"it's probably good enough to offer high availability" is a way to think
differently. But it's not the way I conceive reliable infrastructures.

> I'm talking mainly about 
> monitoring/automation/deployment tools & centralized coordination 
> service tools - so that you can automatically react to any change in the 
> infrastructure.

Changes should not happen often, so you can expect that they come with
a minor cost. Or you have something which makes your servers die several
times a minute and you need to fix that before considering adding servers.

> With those tools you don't really care if some server ip changes - the 
> ip only changes if you stop/start and start ec2 instance.
> If you reboot ec2 instance the ip does not change. But normally you 
> would not really stop/start instance - this really happens  then 
> something bad happens to the instance, so that you need to reboot it, 
> but reboot does not always works since there might be hardware problem 
> on the server hosting this ec2 instance.
> So you need to stop it and then start - then you start it will start on 
> different hardware server.

Fine. In the real world, when a server is dead, one guy comes with a
master, reinstalls it on another hardware and restores its configuration.
The IP is taken back and everything magically works again. In the VPC you
should be able to do that too when you decide to replace a faulty instance.

> But you don't really need to all this stuff manually. If some ec2 
> instance is sick this is detected and propagated through the centralized 
> coordination service to the relevant parties.

Here I think you need to define "sick". For me, a "sick" server is one
that needs a stop/start or reboot sequence to be fine again. Otherwise
it's considered dead and needs at least repair, at worst replacement.
Repair is covered by high availability. In case of replacement, you can
keep the IP, so seen from the LB it's just a repair.

> Then you can decide to 
> start a service from a failed instance on another already running ec2 
> instance or start new instance configure itself and start the service. 

If the already running instance was OK, why was it not integrated in the
LB farm then ?

> The old failed instance can be just killed or suspended. (So VPC or 
> normal datacenter will not help here - since the service will be running 
> on different instance/server with different ip - yes you could use a 
> floating ip in normal datacenter  but you would not want to do that for 
> every backend especially then backend are automatically added/removed. 

No but here you're already describing corner cases, I see a lot of "if"
here to reach that case, and at this point I think that a simple reload
is the smallest operation to complete the process !

> You would normally use floating ip for the frontend). Then service is 
> active again on another/new instance - this is again propagated through 
> the centralized coordination service. Then you automatically update 
> needed stuff on relevant instances - like in this specific case update 
> /etc/hosts and restart/reload haproxy. (All I wanted is to avoid haproxy 
> restart/reload - there is no technical problem at all to do the 
> restart). And of course all this is done automatically without human 
> intervention.

So you realize that you're saying you lose a server, you look for another
compatible server, you find one which is doing nothing useful, you decide
to install the service on it, you start it, you update all /etc/hosts and
the only thing you don't want to do is to reload a process, which represents
less than 0.01% of all the operations that have been performed automatically
for you ! I don't buy that, that does not make any sense to me, I'm sorry.
For me, it's comparable to the guy who would absolutely want to be able to
power his servers on batteries before moving the rack to another city, so
that he can avoid a shutdown+restart sequence which would kill its impressive
uptimes.

> From where I stand I see no unreliability problem with aws - the normal 
> datacenter is just unreliable for me as aws.
> I don't need the normal datacenter or the VPC. The usage of those tools 
> and the other aws features make aws much more attractive and reliable 
> than normal datacenter.

Quite frankly, given the way you consider reliability, I fail to understand
why you insist on using a load balancer. Why not advertise all your servers
with the DNS and let the monitoring mechanism automatically remove them from
the pool since it's reliable enough for you ? It would remove several steps
which appear annoying in your case, one of them being to updated the load
balancer to consider the new server's address.

> The only really annoying thing about ec2 is that you can have only one 
> ip per instance - this makes the HA stuff more difficult to implement 
> and you have to design it differently that in normal datacenter. AFAIU 
> the aws VPC would not help there too - since VPC instances still can 
> have only one ip or/and you can't reassign it to another ec2 instance.

>From my experience with a huge failure on a large site there, it was not
the only really annoying thing. But I think things might have improved a
little bit. Still, reliable architectures are built from the ground up,
not by trying to hide massive holes and issues with plastic tape. There
are people who have no problem building reliable enough architectures
there but they adapt their process to the hosting provider's limitations.
Many sites are happy with Rightscale for instance, which does all the
dirty work for you. That way you don't have to reload your process, it
will be done automatically if needed.

Regards,
Willy


Reply via email to