On Wed, 15 Mar 2006 19:48:01 EST, Steven said: > later, 6 hours later, 5 days later, etc. Additionally, if some server that > gives a yea/nay is on a coffe + donut break -- what would that have to do > with kicking you offline after already being authenticated?
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." -- Leslie Lamport When designing very large and complex systems, it gets harder and harder to avoid designing into them all sorts of odd dependencies and cascading failure modes. For instance, the last 3 times our modem pool terminal servers got hosed up, it was due to the TACACS server not being able to contact our LDAP server. And the connection timeouts to the LDAP server were caused by some misbehaving software beating up on *another* one of our servers and making that server create a flood of non-optimized LDAP queries (most LDAP server software goes into severe oink mode when it has to do queries it doesn't have a pre-built index for). Got that? The terminal server got indigestion waiting for the TACACS server, which was trying to talk to the LDAP server, but couldn't get a word in edgewise because some OTHER server was spewing broken queries at the LDAP server. Counting the user's machine, and the machine originating the broken query to the other server, there was a total of *7* logical machines in the chain (actually more, as several of these were really multiple machines behind a load balancer). This sort of thing is just *loads* of fun to unsnarl at 10PM, when none of the system architects are handy, but plenty of people are still trying to use the modem pool... ;) I'd not be at *all* surprised if an unexpected failure of AOL's auth server caused the main AIM servers to hiccup when they got a new inbound request and were unable to deal with the failure mode, and dropping all the already existing sessions during the hiccup reset. And to tie it all back into the INCIDENTS charter - one of the most common causes of a corporate server getting exploited is when a hacker finds some code that glues 2 server systems together, and the code isn't perfect. Of course, this also means that it's usually very difficult to figure out what happened, which is why you often see "The hackers had been in the system for at least 3 weeks before they were detected".....
pgpYaf8MvWy2q.pgp
Description: PGP signature
