Below: From: [email protected] [mailto:[email protected]] On Behalf Of Gareth Miles Sent: Friday, May 12, 2017 8:40 AM To: [email protected] Subject: [msmom] SCOM Network outage
Hi Our Company is having an emergency network outage next weekend for 6hrs, possibly longer. I have a SCOM 2012 SP1 management group with 6 management servers in our office which will be effected, and 13 gateway servers around the world which connect to 3 of the management servers with the agent count fairly evenly distributed amongst the three management servers. The site with the largest agent count has around 750 agents, with two gateways and the agents split between them. The other gateways have between 200 to 400 agents connecting to them. During the network outage the gateways will not be able to connect to the management servers, and the management servers will lose connection to the Operationsmanager and WareHouse DB servers. I have three plans in mind, but not sure which is the better of the two, or if there's a cleaner way of managing the outage. Any advice would be appreciated Plan 1 Put all agents into maintenance mode at the windows computer level before the network outage, so only discoveries are processed. KH - that is an incorrect assumption. When you place a Windows Computer into MM, EVERYTHING unloads. Discoveries are no different than rules or monitors in this regard. When the network outage accrues, the gateways and the agents will queue the discovery data until network connectivity returns. KH - no - there will be nothing to queue. If the agents go into MM, they will unload the workflows and send nothing across the wire. Plan 2 Put all agents into maintenance mode, then shut down the management servers and DB servers until the network is back. Plan 3 Leave as is, let gateway and agents queue data till network connectivity returns. Also what is the process for a Gateway/Agent's queue when it can't connect to its Management Server/Gateway, does the queue fill up to a certain size, or till the disk is full? Kind regards Gareth Miles KH - Agents will queue until their queue is full, then will FIFO (first in first out) based on a prioritization. We dump perf data first, and alerts last. Honestly - your choice of action is largely irrelevant. If the outage is network only, then normally you want the agents to queue and write alerts to their queues so you don't miss anything. However, you might see additional alerts from agents because of the network outage impacting applications..... so this will result in a large amount of alerts that wont be "actionable". So placing them into MM or not is a judgement call. Shutting down the gateways and management servers is largely irrelevant. If they queue, they will fill the queue then cut off any more downstream healthservices until the queue can clear. Probably the biggest thing I would want to do, is to ensure you place the agents Health Service Watcher objects into MM, because you don't really want a ton of "computer down" alerts when you know you have a planned network outage.
