> From: [email protected] [mailto:discuss-
> [email protected]] On Behalf Of Lawrence K. Chen, P.Eng.
> 
> Good thing...I'd hate to end the uptime streak that this server has....it had
> been up 2525 days. 

I've said many times before, that you shouldn't be proud of your uptime, 
because it means you're not applying updates, so you're exposing yourself to 
bugs & vulnerabilities.  I understand sometimes systems run in a protected 
environment where that's not much of a concern.  But there are 2-3 other 
reasons which you've demonstrated:

"Why does it work," to login to console instead of ssh, when a system is out of 
disk or memory.  Well, often times, it doesn't.  So you were lucky it worked 
for you this time.  The same thing that prevents sshd from spawning another 
process sometimes prevents bash, or ls, or kill, or rm from spawning or 
otherwise working properly.  But if you're patient enough, if you retry enough 
times, and/or if you're lucky, sometimes it works.

When a system is out of memory or out of disk, the processes it runs have a 
tendency to become unstable, because the "out of memory" errors and "IO" errors 
that hit the running processes are somewhat inconsistent, as tiny little chunks 
of memory and disk get freed by other processes or otherwise reclaimed by 
kernel, sometimes the running processes will succeed and sometimes not.  Until 
a few minutes elapse and the system is likely completely wedged.  After you 
alleviate the cause of the problem, in order to assure system stability, you 
really *should* reboot.

Which brings up the second point.  If you never reboot, then you don't know if 
your system is able to survive a reboot.  I can't describe how many times, or 
how much frustration I've had to endure, by inheriting systems from former 
admins (usually fired) who never rebooted.  It's painfully common that there is 
no documentation of which daemons need to be running, or what system 
dependencies exist (first boot machine #1 and ensure the "foobar" service is 
available before booting machine #2, etc), nor how they were started in the 
first place.  If I'm lucky enough to look at the system while it's operable, I 
run 'ps' and stuff, to make an inventory of what services are running.  I login 
as root and run "history" and start reading for clues.  I see things like 
"nohup java blah blah" and "screen this and that" and ...  You get the point.  
Makes me want to fire the person again in front of a firing squad.  
Undocumented manually launched services running in production.  Brilliant.  
 Hopefully you at least have none of those.
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to