At 9:17 PM +0000 8/20/02, Tribavan Raina wrote:
>Hi all,
>
>
>I have a customer who is using Cat 4006 as a backbone switch.The customer
>needs to find out the some questions regarding the failure of this switch.
>
>4) what are the risks and likelyhood of component or delivery failure.
>
>4) what ways can we i) reduce the likelyhood of failure in the first
>instance
>
>
>Hope some of you experienced guys can help meout.Does cisco provide MTBF
>stats about the devices or not.
>
>It is urgent plz help me out.
>
In designing high-availability systems, I haven't found MTBF to be
terribly useful. MTBF is useful when you are dealing with a large
number of identical systems (e.g., brakes on a car), but when you are
dealing with samples of one or two devices, it's too small a sample
to depend on it.
What I find much more useful is MTTR. Discuss with your customer what
the cost of downtime would be, and then consider the amount of
downtime that would result from various failure modes. Let's say the
supervisor board fails. Does the customer have people 24/7 that are
qualified to replace it if it fails? Does the customer keep a spare?
If it will take 24 hours to get a replacement, multiply the hourly
cost of downtime by 24.
Again, when you are dealing with single or small numbers of
components, the specific subcomponent that will fail is fairly
unpredictable. It's often simpler to have a backup box than to guess
the spares you will need and have hardware-qualified technicians
always available.
In considering reliability, also consider such things as maintenance,
including software upgrades. At the moment, I'm designing a
high-availability system for a clinical medical customer, and the
number of devices needed at a critical point is:
P + B + M
where P is the number of devices needed to handle the normal production load,
B is usually 1, but is the number of devices on hot standby
M, almost always 1, is a device available for maintenance.
If you have several clusters of identical and colocated equipment,
the M, but not the B, devices can be shared across some reasonable
number of clusters.
Returning to your question about reducing failure:
1. Good power, filtered and uninterruptible (not just battery backup,
but physically protected so the janitor doesn't unplug it to use
the floor scrubber)
2. Proper environmental controls, certainly including temperature and
humidity, but also local hazards -- vibration, etc. Be sure cooling
input and hot air output can't be blocked by other equipment.
3. Screw down connectors and/or tie-wrap them.
4. Log all changes, preferably on a write-once (e.g., CD-R) syslog
5. Use care on who has enable passwords, and change periodically
6. Always have a TFTP server available
7. Don't rush to use the latest software release unless you absolutely
need a new feature, or the release fixes a bug
Message Posted at:
http://www.groupstudy.com/form/read.php?f=7&i=51817&t=51788
--------------------------------------------------
FAQ, list archives, and subscription info: http://www.groupstudy.com/list/cisco.html
Report misconduct and Nondisclosure violations to [EMAIL PROTECTED]