Below:

From: [email protected] [mailto:[email protected]] On 
Behalf Of Gareth Miles
Sent: Friday, May 12, 2017 8:40 AM
To: [email protected]
Subject: [msmom] SCOM Network outage

Hi

Our Company is having an emergency network outage next weekend for 6hrs, 
possibly longer.

I have a SCOM 2012 SP1 management group with 6 management servers in our office 
which will be effected, and 13 gateway servers around the world which connect 
to 3 of the management servers with the agent count fairly evenly distributed 
amongst the three management servers.
The site with the  largest agent count has around 750 agents, with two gateways 
and the agents split between them.
The other gateways have between 200 to 400 agents connecting to them.

During the network outage the gateways will not be able to connect to the 
management servers, and the management servers will lose connection to the 
Operationsmanager and WareHouse DB servers.

I have three plans in mind, but not sure which is the better of the two, or if 
there's a cleaner way of managing the outage.
Any advice would be appreciated

Plan 1
Put all agents into maintenance mode at the windows computer level before the 
network outage, so only discoveries are processed.

KH - that is an incorrect assumption.  When you place a Windows Computer into 
MM, EVERYTHING unloads.  Discoveries are no different than rules or monitors in 
this regard.

When the network outage accrues, the gateways and the agents will queue the 
discovery data until network connectivity returns.

KH - no - there will be nothing to queue.  If the agents go into MM, they will 
unload the workflows and send nothing across the wire.

Plan 2
Put all agents into maintenance mode, then shut down the management servers and 
DB servers until the network is back.

Plan 3
Leave as is, let gateway and agents queue data till network connectivity 
returns.


Also what is the process for a Gateway/Agent's queue when it can't connect to 
its Management Server/Gateway, does the queue fill up to a certain size, or 
till the disk is full?

Kind regards
Gareth Miles

KH - Agents will queue until their queue is full, then will FIFO (first in 
first out) based on a prioritization.  We dump perf data first, and alerts last.

Honestly - your choice of action is largely irrelevant.  If the outage is 
network only, then normally you want the agents to queue and write alerts to 
their queues so you don't miss anything.  However, you might see additional 
alerts from agents because of the network outage impacting applications..... so 
this will result in a large amount of alerts that wont be "actionable".  So 
placing them into MM or not is a judgement call.

Shutting down the gateways and management servers is largely irrelevant.  If 
they queue, they will fill the queue then cut off any more downstream 
healthservices until the queue can clear.

Probably the biggest thing I would want to do, is to ensure you place the 
agents Health Service Watcher objects into MM, because you don't really want a 
ton of "computer down" alerts when you know you have a planned network outage.



Reply via email to