On Fri, Sep 26, 2014 at 12:57:18PM +0200, Klavs Klavsen wrote:
> Being able to handle if some server fubars itself (in my case one server
> suddenly had a stale NFS handle) - or if I f.ex. do automatic upgrades..
> it's very nice to just know that while the webserver (sometimes a
> tomcat - which takes ~2 minutes to get ready) gets ready again, my
> loadbalancer will simply handle failed requests by serving them to the
> others - until the health checker comes by (if it's really fubar) and
> removes it.
It's just putting a brown paper bag on a very stinky piece of shit. Sure
you don't see the shit anymore, but its smell indicates that it's under
the paper (or the flies show you the way).
The examples you provide just show that the minimum requirements for
high availability are simply not met :
- seems like no redundancy (NFS)
- no planning for operations (automatic upgrades must NEVER cause
any trouble because they must be correctly advertised). No amount
of sauce in front of this will ever fix whatever process you're
suddenly interrupting there.
- considering that you'd hide the problem for 2 minutes... Can you
imagine how long 2 minutes are in terms for internet traffic ? Your
servers MUST be checked, and it's not sending retries to another
server for two minutes which will correctly hide the effects of
the restarting server. Users will still feel all the trouble,
including the random outputs, the slow responses, etc...
Such retries should never be done. When a request was sent to a server,
it has some effects, even if it's idempotent. It produces logs through
a number of layers, causes errors, etc... Replaying it on another set
of servers with everything unchanged causes much more harm than it hides.
For example when you have a unique ID in your transactions set by a front
proxy, now you're having two sets of logs with the same unique ID and with
different results. Great...
> > The ability of Varnish (and Nginx for that matter) to do this is an
> > anti-feature, IMHO.
> >
> You must live in another world than I do.
Well, maybe he simply ensures his infrastructure is properly built and
troubleshootable.
> In my world it's a must-have feature.
I agree with him that it's an anti-feature as well.
> It means that if one server has 10% of it's requests that "screws up" -
> due to f.ex. a stale nfs handle.. the 90% will still be able to be
> served by it (ie. it's not pulled out entirely) - and I won't serve
> broken requests in the "health-check interval" period..
Wow! So you consider it good enough to remain in production with 10% of
errors ? I would already have screamed with 1%, but 10%! No internet
facing web site nowadays can display correctly with 10% of errors. The
problem must be fixed at its root, not hidden.
> and frankly - I'd would be very sorry to let the health-check pull out
> the ENTIRE backend server - just because 10% of the site(s) on it are
> fubar. I'd rather just have it retry another in those 10% cases, until I
> fix the issue.
Your server is simply dead and unreliable. Why not serve truncated or
even corrupted objects if you're going that route ? You're just relying
on your server's FS cache to serve objects that are still present in
memory after your unique NFS server died. The solution tends to be :
don't do this.
Regards,
Willy