Yesterday there turned out to be two unresponsive servers at work. I wasn't on call, so I didn't immediately know about the first one. But, nagios had complained about the second one.
So, I had connected to work and tried to ssh into our loghost. No go, not enough disk space left to do that. No worries, its a zone and its zfs dataset has a quota. I'll just remove that temporarily so that I can get on and clean up. Which is why we keep some of the zpool back from all the zones on the system. Would've had to steal space, its also why some of us still prefer /var to be separate from root....though its not the way zones are being done. I then find out that loghost ran out of space, because another server is spewing that /tmp was full, and that there was no more swap. Well, that happens on Solaris. I tried to ssh to it, of course that failed. So, I track down how to get to console, log in and clean up /tmp and kill off the backlog of cron jobs that are filling up /tmp. Later the on call person calls me, he was on his way in to reboot the server when he saw that I had fixed the problem. Wanted to know how I had gotten on. I suppose the simple answer was he should've tried console when ssh didn't work, and left it at that. But, he asked the all important question, that so rarely comes up. "Why does that work?" So, the longer explanation is that sshd handles incoming connections by wanting to fork itself first, which is hard when there's practically no free memory on the system. getty OTOH, handles the user authentication and then exec's (which replaces the process memory that its using, with) the login shell. And, luckily there was still enough available for that to work. Also its necessary to use root instead of our individual admin accounts, because /bin/sh is only about 20% bigger than getty. While shells like bash/tcsh/zsh are more than 5 times bigger than /bin/sh. Most of us use bash for our admin accounts, one person uses tcsh (though he never complained that his account had stopped working, because tcsh wasn't being made available anymore...he'd just use root directly), and another has been playing around with zsh. Though not sure how it translates if it was a system with ttymon and more than one tty port.... Once on, it was then a challenge figuring out how to identify the where the problem files were and deal with them...by first going after older files, since removing an open file won't fix the problem :) I recall that there are other tricks in this area, but ls worked intermittently enough to get things working again. Good thing...I'd hate to end the uptime streak that this server has....it had been up 2525 days. (that means it was somewhere that didn't lose power the times we had transfer switch problem and a significant number of UPSs had run down before something was done....in one case we stayed on generator power for 7 days.... 2525 would be its been up since the PDU problems that had been going on when I first started (It has been 2540 days since I started). The PDU problem was that it wasn't correctly configured for use with a generator. We could go on to generator, but it would shutoff when switching back to utility power. Eventually somebody read the documentation and saw that a switch needs to be changed for use with a generator..... Wonder if we have any other servers with that kind of uptime? I know there was been a request that we provide continuous reporting of system uptimes. Though I suspect its to make sure that we don't have high uptimes on any of our systems..... -- Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator For: Enterprise Server Technologies (EST) -- & SafeZone Ally Snail: Computing and Telecommunications Services (CTS) Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102 Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected] Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
