I have to agree that this is all good information.

Your question on ITIL: My personal opinion is that ITIL best practices are 
great to apply to all environments. It makes sense, specifically in the change 
control systems.

However, as stated, it's also highly dependent on how many devices being 
managed/monitored. I come from a NOC managing 8600+ network devices across 190+ 
countries.

Strict change management policies, windows, approvers. All depending on times 
relative to the operations in different countries.

We were growing so rapidly that we continued purchasing companies and bringing 
over their infrastructure. Each time bringing in new ticket systems, etc.

NNM is by far one of my favorite choices for network monitoring. The issue with 
it is really the views and getting them organized in an easily viewable fashion.

RT is a great ticketing tool for specific needs. It allows for approvers and 
approval tracking of tickets. However, it isn't extremely robust.

I would recommend something like HP ServiceCenter since it can integrate and 
automate the alert output directly to tickets. This also allows the capability 
to use Alarmpoint for automated paging of your on-calls based on their 
schedules, by device, etc.

Not to say that I'm a complete HP fan boy but I will say that it works 
extremely well. Easy to use and simplicity is the key to less mistakes.

All of our equipment was 99% Cisco so the combination worked extremely well.

Turnover : I firmly believe shift changes should be verbally handed off. Build 
a template for the days top items or most critical issues. List out the ongoing 
issues and any tickets being carried over with the status. Allot 15 minutes for 
the team to sit down with the printout and review it.

Contracts/SLA's:
 We placed all of our systems in a bulk 99.999% uptime critical SLA. However, 
this was a mistake on our part and the lack of time to plan well when adapting 
to an ever-changing environment.

It would be best to setup your appliances/hardware in your ticket system and 
monitoring tool based on the SLA you intend to apply to it. Also ensure you 
include all hardware information: Supply Vendor, Support Vendor, Support 
coverage, ETR from Vendor, Replacement time.

There are many tools that do automated discovery on your network and monitors 
changes on the network. This is key if you have a changing environment. The 
more devices you have, the more difficult it is to pinpoint what a failed 
router or switch ACTUALLY affects upstream or downstream.

If this is your chance, take the opportunity to map your hardware/software 
dependencies. If a switch fails and it provides service to: example: db01 and 
db01 drives the service in another location. Then you should know that failure 
is there. It's far too common for companies to get so large they have no idea 
what the impact of 1 port failure in xyz does to the entire infrastructure.

Next: Build your monitoring infrastructure completely separate than the rest of 
the network. If you don't do switch redundancy (active/passive) on all of your 
systems or NIC teaming (active/passive) then ensure you do it at least on your 
monitoring systems.

Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on 
everything, log retention based on your need. Tripwire with approved reports 
being sent weekly on the systems requiring PCI/SOX monitoring.

Remember, if your monitoring systems go down, your NOC is blind. It's highly 
recommend that the NOC have gateway/jump box systems available to all parts of 
the network. Run the management completely on RFC1918 for security.

Ensure all on-calls have access, use a VPN solution that requires a password + 
vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log 
everything. I can't say that enough.

Enforce pw changes every 89 days, require strong passwords/non dictionary, etc.

Build an internal site, use a wiki-based format, allow the team the ability to 
add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so 
your team can post extra tips/notes, one-offs. Anything that may help new 
members or people who run across something in the middle of the night they may 
have never seen. This keeps from waking up your lead staff in the middle of the 
night.

On-calls: Always have a primary/secondary with a clear on-call procedure 
'documented'.
Example (critical):
1. Issue occurs
2. Page on-call within 10 minutes
3. Allow 10 minutes for return call.
4. Page again
5. Allow 5 minutes
6. Page secondary
Etc.

Ensure the staff documents every step they take and they copy/paste every page 
they send into the ticket system.

Build templated paging formats. Understand that most txt messages with several 
carriers have hard limits. Use something like:
Time InitialsofNOCPerson SystemAlerting Error CallbackNumber

(Ie. 14:05 KH nycgw01 System reports down 555-555-5555 xt103)

Use a paging internal website/software or as mentioned, something like 
Alarmpoint. 

There is nothing more frustrating for an on-call to be paged and have no idea 
who to call back, who paged, or what the number is.

I've written so much my fingers hurt from these Blackberry keys. Hope this 
information helps a little.

Best of luck,
-Kevin

Excuse the spelling/punctuation... This is from my mobile.
-----Original Message-----
From: Joe Provo <[email protected]>
Date: Sat, 17 Jul 2010 14:56:04 
To: Kasper Adel<[email protected]>
Reply-To: [email protected]
Cc: NANOG list<[email protected]>
Subject: Re: NOC Best Practices

On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
> Thanks for all the people that replied off list, asking me to send them
> responses i will get.
[snip]
> Which is useful but i am looking for more stuff from the best people that
> run the best NOCs in the world.
> 
> So i'm throwing this out again.
> 
> I am looking for pointers, suggestions, URLs, documents, donations on what a
> professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

> 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

> 2) How they create a learning environment for their people (Documenting
> Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage & response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

> 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

> 4) Manual tests  they start their day with and what they automate (common
> stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

> 5) Change management best practices and working with operations/engineering
> when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
             RSUC / GweepNet / Spunk / FnB / Usenix / SAGE

Reply via email to