Thanks for the detailed writeup-- I'd like to see more posts like this. One question I had was that you were talking about full disks, but then started talking about being unable to call fork() due to RAM constraints. That threw me for a bit of a loop-- how did a full disk completely exhaust your RAM?
--Corey On Jun 23, 2013, at 11:11 AM, "Lawrence K. Chen, P.Eng." <[email protected]> wrote: > Yesterday there turned out to be two unresponsive servers at work. I wasn't > on call, so I didn't immediately know about the first one. But, nagios had > complained about the second one. > > So, I had connected to work and tried to ssh into our loghost. No go, not > enough disk space left to do that. No worries, its a zone and its zfs > dataset has a quota. I'll just remove that temporarily so that I can get on > and clean up. Which is why we keep some of the zpool back from all the zones > on the system. Would've had to steal space, its also why some of us still > prefer /var to be separate from root....though its not the way zones are > being done. > > I then find out that loghost ran out of space, because another server is > spewing that /tmp was full, and that there was no more swap. Well, that > happens on Solaris. I tried to ssh to it, of course that failed. So, I > track down how to get to console, log in and clean up /tmp and kill off the > backlog of cron jobs that are filling up /tmp. > > Later the on call person calls me, he was on his way in to reboot the server > when he saw that I had fixed the problem. Wanted to know how I had gotten on. > > I suppose the simple answer was he should've tried console when ssh didn't > work, and left it at that. But, he asked the all important question, that so > rarely comes up. "Why does that work?" > > So, the longer explanation is that sshd handles incoming connections by > wanting to fork itself first, which is hard when there's practically no free > memory on the system. > > getty OTOH, handles the user authentication and then exec's (which replaces > the process memory that its using, with) the login shell. And, luckily there > was still enough available for that to work. Also its necessary to use root > instead of our individual admin accounts, because /bin/sh is only about 20% > bigger than getty. While shells like bash/tcsh/zsh are more than 5 times > bigger than /bin/sh. Most of us use bash for our admin accounts, one person > uses tcsh (though he never complained that his account had stopped working, > because tcsh wasn't being made available anymore...he'd just use root > directly), and another has been playing around with zsh. > > Though not sure how it translates if it was a system with ttymon and more > than one tty port.... > > Once on, it was then a challenge figuring out how to identify the where the > problem files were and deal with them...by first going after older files, > since removing an open file won't fix the problem :) > > I recall that there are other tricks in this area, but ls worked > intermittently enough to get things working again. > > Good thing...I'd hate to end the uptime streak that this server has....it had > been up 2525 days. (that means it was somewhere that didn't lose power the > times we had transfer switch problem and a significant number of UPSs had run > down before something was done....in one case we stayed on generator power > for 7 days.... 2525 would be its been up since the PDU problems that had been > going on when I first started (It has been 2540 days since I started). The > PDU problem was that it wasn't correctly configured for use with a generator. > We could go on to generator, but it would shutoff when switching back to > utility power. Eventually somebody read the documentation and saw that a > switch needs to be changed for use with a generator..... > > Wonder if we have any other servers with that kind of uptime? I know there > was been a request that we provide continuous reporting of system uptimes. > Though I suspect its to make sure that we don't have high uptimes on any of > our systems..... > > -- > Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator > For: Enterprise Server Technologies (EST) -- & SafeZone Ally > Snail: Computing and Telecommunications Services (CTS) > Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102 > Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected] > Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library > _______________________________________________ > Discuss mailing list > [email protected] > https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss > This list provided by the League of Professional System Administrators > http://lopsa.org/ _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
