On 3/7/12 2:35 PM, Darin Perusich wrote:

> I'm deploying NUT in a data center and I'm curious to know how others
> have gone about staging the shutdown of various systems. The systems
> are broken down into groups of importance, group1 being the most
> important group4 being the least. In the event of an outage take down
> group4 after 5 minutes, group3 after 10 minutes, group2 after 15, and
> finally group4 after 20.
> 
> I've only been using NUT for a few days but I'm assuming I can
> accomplish this with a combination multiple systems, upssched events.

Here's how I handle it. In this case, the criteria is not when the power has
shut down (I don't stage the shutdowns in that case) but if the air conditioner
has failed and the temperature is rising in my server room. I want to shut down
the less-important systems first, and gradually shut down more systems if the
temperature continues to rise.

The "tagger" script is something custom to our site. For purposes of this
script, it defines what you refer to as group4, group3, etc.

The "run_all.sh" script, which I can send you though it's pretty trivial, sends
a group of systems the same command. I normally use it to patch my cluster, but
here it's being used to send the "shutdown" command.

I hope this is useful.

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://[email protected]
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
#!/usr/bin/perl

# Check the ambient temperature in the computer room.
# If it's over a threshold, start shutting down compter systems
# in stages. If the temperature falls back below a recovery point,
# stop escalating the stages.
# 19-Jul-2011 WGS

my $debug;

my $threshold = 90; # in degrees Farenheit
my $recover = 85; # Farenheit.
my $monitor = "eaton-rack-monitor";
my $tagger = "/usr/nevis/adm/get-hosts-by-tag.pl";
my $stagefile = "/var/nevis/ambient-temperature-stage-file";

# The following command assumes that the net-snmp-utils 
# package has been installed on the computer. 
# Many Bothans died to discover the OID for the ambient temperature
# (translation: it takes some exploration with the snmpwalk command
# and some intelligence to figure it out).

my $result = `/usr/bin/snmpget -v1 -c public ${monitor} 1.3.6.1.4.1.534.6.7.1.1.3.2.1.3.1`;

# If this command is successful, the temperature will be in the digits at
# the end of the result.

if ( $result =~ /^.*\s(\d+)/ )
{
	# The digits are an integer, ten times the temperature in degrees Farenheit.
	my $temperature = $1 / 10;

	if ( $temperature < $recover )
	{
		# The A/C situation has recovered. Delete the stage file.
		unlink $stagefile;
	}

	if ( $temperature > $threshold )
	{
		# By default, we're at stage 1, unless a stage file exists.
		my $stage = 1;
		# Does the stage file already exist?
		if ( -r ${stagefile} )
		{
			# It does, so read the last stage from the file.
			open (INPUT,${stagefile}) || die "Cannot open ${stagefile}: $!\n";
			my $laststage = <INPUT>;
			close INPUT;
			chomp $laststage;
			
			# Escalate the stage from last time. 
			$stage = $laststage + 1;
		}
	
		# Write the stage for the next invocation of this script.
		open (OUTPUT, ">".${stagefile}) || die "Cannot open ${stagefile}: $!\n";
		print OUTPUT $stage, "\n";
		close OUTPUT;
		
		# Get the list of systems to be shut down at this stage; this comes
		# from /usr/nevis/adm/host-database.xml
		# In case someone turned on a system that should be turned off, issue
		# the shutdown command for all stages up to this one.
		my $tags;
		for ($s=1; $s<=$stage; $s++)
		{
    		$tags .= " stage" . $s;
		}
		my $systems = `${tagger} ${tags}`;
		
		# Create a temporary file.
		my $message = `/bin/mktemp -t`;
		open (MESSAGE,">".$message) || die "Can't open temporary work file $filename: $!\n";
		
		# Write a message to the sysadmin.
		print MESSAGE <<EOF;
Emergency! \nTemperature situation at stage ${stage}. \n${monitor} reports a temperature of ${temperature}, which is over the threshold of ${threshold}. \nShutting down ${systems}
EOF
		close MESSAGE;
		my $subject = "Emergency temperature shutdown - stage ${stage}";
		`/bin/mail -s "${subject}" sysadmin\@nevis.columbia.edu < $message`;

		# "-h" means 'halt'
		# "+2" means give the users two minutes to log off, issuing ominous warnings all the while.
		# The "-k" option of /sbin/shutdown means to cancel the shutdown; i.e., don't do it.
		# Take it out when we move into production. 22-Jul-2011 - removed! 
		my $command = "wall Emergency shutdown due to computer room temperature of ${temperature}; shutdown -h +2 Computer room temperature alert";
		
		# Issue the command to the selected systems.
		# "-f" means don't wait for one computer's shutdown command to finish
		# before moving on to the next. 
		`/usr/nevis/adm/run_all.sh -f "$command" "${systems}"`;
		
		# It's probably silly, but if we shut down the login servers, to keep the
		# mail and web servers running properly, we have to reboot them. (It's silly
		# because if we've escalated to this point, we're probably going to move on
		# to shut down mail and web services anyway.)
		if ( ${stage} == 3 )
		{
			`/usr/nevis/adm/run_all.sh -f "/sbin/shutdown -r now" "mail www"`;
		}
	}
}
else
{
	die "Error in reading temperature:\n$result\n"
}

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Nut-upsuser mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/nut-upsuser

Reply via email to