Unplanning network maintenance/outage

2013-03-17 Thread Derek Atkins
Good morning, GnuCashers,

Some (many?) of you may have noticed the outage of 'code.gnucash.org'
starting with a lot of packet loss on Thursday and escalating into a
complete outage by Friday.  This took out our Subversion, Wiki, Email
List, everything server.  Well, as of 2:15pm US/EDT on Saturday
(yesterday) everything should be back to normal and operational.  If you
don't want to hear the gory details of what happened feel free to stop
reading now.

The issue was multiple simultaneous failures of multiple pieces of
equipment.  What I thought was a power outage turned out be caused by a
failure in my main network switch.  It started dropping ports, or
causing ports to fail partially (dropping packets).  This was also the
main cause of the packet loss, too.  However I didn't discover this
until later.

My main DHCP server was off the net; I swapped ethernet cables and it
appeared to fix the problem.

My main database server, however, lost its main network controller so I
had to install a new one (I have a few on hand, so it was a relatively
painless operation -- I just had to remember the magic voodoo to get the
system to call the new card 'eth0', but that was also only a few
minutes).

It was only after I got this working that I realized that it was the
switch that had failed -- many of the ports connected to actual hosts
had a 'dead link'.  I also noticed that my main DHCP server was
bouncing.  It would come on the net, stay for a bit, and then go dark.
Luckily I also had a few extra (smaller) switches lying around so I
linked a few of them together and moved all the non-working ports over.
This also fixed the bouncing DHCP server.

Last, but not least, the VM Server Host's network was wedged, requiring
a complete reboot to reset.  This also required resetting all the VMs,
some of which required a bit of hand-holding to come back (and many of
which required a virtual disk fsck as well, taking even more time).  The
last of the systems returned to service shortly after 2pm.

I do plan to acquire a new switch to replace the failing one, but what I
have now is working so I'll watch it closely for now.

Thanks,

-derek

-- 
   Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
   Member, MIT Student Information Processing Board  (SIPB)
   URL: http://web.mit.edu/warlord/PP-ASEL-IA N1NWH
   warl...@mit.eduPGP key available
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: Unplanning network maintenance/outage

2013-03-17 Thread Ted Creedon
Do you need a UPS?

Sounds like a power related problem

tedc

On Sun, Mar 17, 2013 at 5:18 AM, Derek Atkins warl...@mit.edu wrote:

 Good morning, GnuCashers,

 Some (many?) of you may have noticed the outage of 'code.gnucash.org'
 starting with a lot of packet loss on Thursday and escalating into a
 complete outage by Friday.  This took out our Subversion, Wiki, Email
 List, everything server.  Well, as of 2:15pm US/EDT on Saturday
 (yesterday) everything should be back to normal and operational.  If you
 don't want to hear the gory details of what happened feel free to stop
 reading now.

 The issue was multiple simultaneous failures of multiple pieces of
 equipment.  What I thought was a power outage turned out be caused by a
 failure in my main network switch.  It started dropping ports, or
 causing ports to fail partially (dropping packets).  This was also the
 main cause of the packet loss, too.  However I didn't discover this
 until later.

 My main DHCP server was off the net; I swapped ethernet cables and it
 appeared to fix the problem.

 My main database server, however, lost its main network controller so I
 had to install a new one (I have a few on hand, so it was a relatively
 painless operation -- I just had to remember the magic voodoo to get the
 system to call the new card 'eth0', but that was also only a few
 minutes).

 It was only after I got this working that I realized that it was the
 switch that had failed -- many of the ports connected to actual hosts
 had a 'dead link'.  I also noticed that my main DHCP server was
 bouncing.  It would come on the net, stay for a bit, and then go dark.
 Luckily I also had a few extra (smaller) switches lying around so I
 linked a few of them together and moved all the non-working ports over.
 This also fixed the bouncing DHCP server.

 Last, but not least, the VM Server Host's network was wedged, requiring
 a complete reboot to reset.  This also required resetting all the VMs,
 some of which required a bit of hand-holding to come back (and many of
 which required a virtual disk fsck as well, taking even more time).  The
 last of the systems returned to service shortly after 2pm.

 I do plan to acquire a new switch to replace the failing one, but what I
 have now is working so I'll watch it closely for now.

 Thanks,

 -derek

 --
Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
Member, MIT Student Information Processing Board  (SIPB)
URL: http://web.mit.edu/warlord/PP-ASEL-IA N1NWH
warl...@mit.eduPGP key available
 ___
 gnucash-devel mailing list
 gnucash-devel@gnucash.org
 https://lists.gnucash.org/mailman/listinfo/gnucash-devel

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: Unplanning network maintenance/outage

2013-03-17 Thread Derek Atkins
 I have a AC-DC-AC UPS. I an fairly sure it is not a power related problem.  
The switch is old and has already burned through one power supply. I think it 
just got too old and tired.  I think it burned out the network card, too, 
possibly in its flailing..  I think it all relates to the switch. 

-derek

Sent from my HTC smartphone

- Reply message -
From: Ted Creedon tcree...@easystreet.net
To: Derek Atkins warl...@mit.edu
Cc: gnucash-annou...@gnucash.org, gnucash-devel@gnucash.org, 
gnucash-u...@gnucash.org
Subject: Unplanning network maintenance/outage
Date: Sun, Mar 17, 2013 11:56 AM
Do you need a UPS?

Sounds like a power related problem

tedc

On Sun, Mar 17, 2013 at 5:18 AM, Derek Atkins warl...@mit.edu wrote:

Good morning, GnuCashers,



Some (many?) of you may have noticed the outage of 'code.gnucash.org'

starting with a lot of packet loss on Thursday and escalating into a

complete outage by Friday.  This took out our Subversion, Wiki, Email

List, everything server.  Well, as of 2:15pm US/EDT on Saturday

(yesterday) everything should be back to normal and operational.  If you

don't want to hear the gory details of what happened feel free to stop

reading now.



The issue was multiple simultaneous failures of multiple pieces of

equipment.  What I thought was a power outage turned out be caused by a

failure in my main network switch.  It started dropping ports, or

causing ports to fail partially (dropping packets).  This was also the

main cause of the packet loss, too.  However I didn't discover this

until later.



My main DHCP server was off the net; I swapped ethernet cables and it

appeared to fix the problem.



My main database server, however, lost its main network controller so I

had to install a new one (I have a few on hand, so it was a relatively

painless operation -- I just had to remember the magic voodoo to get the

system to call the new card 'eth0', but that was also only a few

minutes).



It was only after I got this working that I realized that it was the

switch that had failed -- many of the ports connected to actual hosts

had a 'dead link'.  I also noticed that my main DHCP server was

bouncing.  It would come on the net, stay for a bit, and then go dark.

Luckily I also had a few extra (smaller) switches lying around so I

linked a few of them together and moved all the non-working ports over.

This also fixed the bouncing DHCP server.



Last, but not least, the VM Server Host's network was wedged, requiring

a complete reboot to reset.  This also required resetting all the VMs,

some of which required a bit of hand-holding to come back (and many of

which required a virtual disk fsck as well, taking even more time).  The

last of the systems returned to service shortly after 2pm.



I do plan to acquire a new switch to replace the failing one, but what I

have now is working so I'll watch it closely for now.



Thanks,



-derek



--

       Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory

       Member, MIT Student Information Processing Board  (SIPB)

       URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH

       warl...@mit.edu                        PGP key available

___

gnucash-devel mailing list

gnucash-devel@gnucash.org

https://lists.gnucash.org/mailman/listinfo/gnucash-devel
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel