Thanks for the detailed writeup-- I'd like to see more posts like this. 

One question I had was that you were talking about full disks, but then started 
talking about being unable to call fork() due to RAM constraints. That threw me 
for a bit of a loop-- how did a full disk completely exhaust your RAM? 

--Corey

On Jun 23, 2013, at 11:11 AM, "Lawrence K. Chen, P.Eng." <[email protected]> wrote:

> Yesterday there turned out to be two unresponsive servers at work.  I wasn't 
> on call, so I didn't immediately know about the first one.  But, nagios had 
> complained about the second one.
> 
> So, I had connected to work and tried to ssh into our loghost.  No go, not 
> enough disk space left to do that.  No worries, its a zone and its zfs 
> dataset has a quota.  I'll just remove that temporarily so that I can get on 
> and clean up.  Which is why we keep some of the zpool back from all the zones 
> on the system.  Would've had to steal space, its also why some of us still 
> prefer /var to be separate from root....though its not the way zones are 
> being done.
> 
> I then find out that loghost ran out of space, because another server is 
> spewing that /tmp was full, and that there was no more swap.  Well, that 
> happens on Solaris.  I tried to ssh to it, of course that failed.  So, I 
> track down how to get to console, log in and clean up /tmp and kill off the 
> backlog of cron jobs that are filling up /tmp.
> 
> Later the on call person calls me, he was on his way in to reboot the server 
> when he saw that I had fixed the problem.  Wanted to know how I had gotten on.
> 
> I suppose the simple answer was he should've tried console when ssh didn't 
> work, and left it at that.  But, he asked the all important question, that so 
> rarely comes up.  "Why does that work?"
> 
> So, the longer explanation is that sshd handles incoming connections by 
> wanting to fork itself first, which is hard when there's practically no free 
> memory on the system.
> 
> getty OTOH, handles the user authentication and then exec's (which replaces 
> the process memory that its using, with) the login shell.  And, luckily there 
> was still enough available for that to work.  Also its necessary to use root 
> instead of our individual admin accounts, because /bin/sh is only about 20% 
> bigger than getty.  While shells like bash/tcsh/zsh are more than 5 times 
> bigger than /bin/sh.  Most of us use bash for our admin accounts, one person 
> uses tcsh (though he never complained that his account had stopped working, 
> because tcsh wasn't being made available anymore...he'd just use root 
> directly), and another has been playing around with zsh.
> 
> Though not sure how it translates if it was a system with ttymon and more 
> than one tty port....
> 
> Once on, it was then a challenge figuring out how to identify the where the 
> problem files were and deal with them...by first going after older files, 
> since removing an open file won't fix the problem :)
> 
> I recall that there are other tricks in this area, but ls worked 
> intermittently enough to get things working again.
> 
> Good thing...I'd hate to end the uptime streak that this server has....it had 
> been up 2525 days. (that means it was somewhere that didn't lose power the 
> times we had transfer switch problem and a significant number of UPSs had run 
> down before something was done....in one case we stayed on generator power 
> for 7 days.... 2525 would be its been up since the PDU problems that had been 
> going on when I first started (It has been 2540 days since I started).  The 
> PDU problem was that it wasn't correctly configured for use with a generator. 
>  We could go on to generator, but it would shutoff when switching back to 
> utility power.  Eventually somebody read the documentation and saw that a 
> switch needs to be changed for use with a generator.....
> 
> Wonder if we have any other servers with that kind of uptime?  I know there 
> was been a request that we provide continuous reporting of system uptimes.  
> Though I suspect its to make sure that we don't have high uptimes on any of 
> our systems.....
> 
> -- 
> Who: Lawrence K. Chen, P.Eng. - W0LKC - Senior Unix Systems Administrator
> For: Enterprise Server Technologies (EST) -- & SafeZone Ally
> Snail: Computing and Telecommunications Services (CTS)
> Kansas State University, 109 East Stadium, Manhattan, KS 66506-3102
> Phone: (785) 532-4916 - Fax: (785) 532-3515 - Email: [email protected]
> Web: http://www-personal.ksu.edu/~lkchen - Where: 11 Hale Library
> _______________________________________________
> Discuss mailing list
> [email protected]
> https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
> This list provided by the League of Professional System Administrators
> http://lopsa.org/
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to