Re: [OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Dan McDonald

> On May 9, 2017, at 4:40 PM, Schweiss, Chip  wrote:
> 
> Here's the screen shot:



Interesting.

So notice that the IP address in question is 10.28.17.29 (uggh, the leading-0 
is a Mentat-ism we need to fix in -gate already).  And notice that the other 
node's MAC is 0c:c4:7a:66:a0:ad ?  You should see what node that MAC belongs to.

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Dan McDonald

> On May 9, 2017, at 3:32 PM, Schweiss, Chip  wrote:
> 
> This was a first for me and extremely painful to locate.
> 
> In the middle of the night between last Friday and Saturday, I started 
> getting down alerts from most of my network.   It took 4 engineers including 
> myself 9 hours to pinpoint the source of the problem.
> 
> The problem turned out to be one of my OmniOS boxes sending out pure garbage 
> constantly on layer 2 out the 10G network ports.   This disrupted ARP caches 
> on every machine on every VLAN that was trunked on these ports, not just the 
> VLANs that were configured on the server.   The switches reported every port 
> healthy and without error.   The traffic on the bad port was not high either, 
> just severely disruptive.

Whoa!  On L2 (like non-TCP/IP ethernet frames)?

> The affected OmniOS box appear to be healthy, as it was still serving the VM 
> data stores for over 350 virtual machines.   However, it like every other 
> service on the network appeared to be up and down repeatedly, but NFS kept on 
> recovering gracefully.
> 
> The only thing that finally identified this server was when one of us plug a 
> monitor to the console and saw "WARNING: proxy ARP problem?"  happening so 
> fast that it took taking a cellphone picture of it a high frame rate to read 
> it.   Powering off this server, cleared the problem for the entire network, 
> and its pools were taken over by its HA sister.

If it's easy to do so, unplug or "ifconfig down" the interface next time this 
happens.

> Googling for that warning brings up nothing useful.
> 
> Has anyone ever seen a problem like this?   How did you locate it?

Should search src.illumos.org, you'll find this:


http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/inet/ip/ip_arp.c#1449

We appear to be freaking out over another node having our IP.  The only caller 
with AR_CN_BOGON is after ip_nce_resolve_all() returns AR_BOGON.

I wonder if some other entity had the same IP, and they 
fed-back-upon-each-other negatively?

The message you cite should show an IP address with it:

"proxy ARP problem?  Node '%s' is using %s on %s",

where the %s-es are MAC-address, IP-address, and interface-name respectively.  
You didn't get examples with your digital camera, did you?

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Schweiss, Chip
This was a first for me and extremely painful to locate.

In the middle of the night between last Friday and Saturday, I started
getting down alerts from most of my network.   It took 4 engineers
including myself 9 hours to pinpoint the source of the problem.

The problem turned out to be one of my OmniOS boxes sending out pure
garbage constantly on layer 2 out the 10G network ports.   This disrupted
ARP caches on every machine on every VLAN that was trunked on these ports,
not just the VLANs that were configured on the server.   The switches
reported every port healthy and without error.   The traffic on the bad
port was not high either, just severely disruptive.

The affected OmniOS box appear to be healthy, as it was still serving the
VM data stores for over 350 virtual machines.   However, it like every
other service on the network appeared to be up and down repeatedly, but NFS
kept on recovering gracefully.

The only thing that finally identified this server was when one of us plug
a monitor to the console and saw "WARNING: proxy ARP problem?"  happening
so fast that it took taking a cellphone picture of it a high frame rate to
read it.   Powering off this server, cleared the problem for the entire
network, and its pools were taken over by its HA sister.

Googling for that warning brings up nothing useful.

Has anyone ever seen a problem like this?   How did you locate it?

-Chip
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss