Re: [OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Dan McDonald

> On May 9, 2017, at 4:40 PM, Schweiss, Chip  wrote:
> 
> Here's the screen shot:



Interesting.

So notice that the IP address in question is 10.28.17.29 (uggh, the leading-0 
is a Mentat-ism we need to fix in -gate already).  And notice that the other 
node's MAC is 0c:c4:7a:66:a0:ad ?  You should see what node that MAC belongs to.

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Dan McDonald

> On May 9, 2017, at 3:32 PM, Schweiss, Chip  wrote:
> 
> This was a first for me and extremely painful to locate.
> 
> In the middle of the night between last Friday and Saturday, I started 
> getting down alerts from most of my network.   It took 4 engineers including 
> myself 9 hours to pinpoint the source of the problem.
> 
> The problem turned out to be one of my OmniOS boxes sending out pure garbage 
> constantly on layer 2 out the 10G network ports.   This disrupted ARP caches 
> on every machine on every VLAN that was trunked on these ports, not just the 
> VLANs that were configured on the server.   The switches reported every port 
> healthy and without error.   The traffic on the bad port was not high either, 
> just severely disruptive.

Whoa!  On L2 (like non-TCP/IP ethernet frames)?

> The affected OmniOS box appear to be healthy, as it was still serving the VM 
> data stores for over 350 virtual machines.   However, it like every other 
> service on the network appeared to be up and down repeatedly, but NFS kept on 
> recovering gracefully.
> 
> The only thing that finally identified this server was when one of us plug a 
> monitor to the console and saw "WARNING: proxy ARP problem?"  happening so 
> fast that it took taking a cellphone picture of it a high frame rate to read 
> it.   Powering off this server, cleared the problem for the entire network, 
> and its pools were taken over by its HA sister.

If it's easy to do so, unplug or "ifconfig down" the interface next time this 
happens.

> Googling for that warning brings up nothing useful.
> 
> Has anyone ever seen a problem like this?   How did you locate it?

Should search src.illumos.org, you'll find this:


http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/inet/ip/ip_arp.c#1449

We appear to be freaking out over another node having our IP.  The only caller 
with AR_CN_BOGON is after ip_nce_resolve_all() returns AR_BOGON.

I wonder if some other entity had the same IP, and they 
fed-back-upon-each-other negatively?

The message you cite should show an IP address with it:

"proxy ARP problem?  Node '%s' is using %s on %s",

where the %s-es are MAC-address, IP-address, and interface-name respectively.  
You didn't get examples with your digital camera, did you?

Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] OmniOS DOS'd my entire network

2017-05-09 Thread Schweiss, Chip
This was a first for me and extremely painful to locate.

In the middle of the night between last Friday and Saturday, I started
getting down alerts from most of my network.   It took 4 engineers
including myself 9 hours to pinpoint the source of the problem.

The problem turned out to be one of my OmniOS boxes sending out pure
garbage constantly on layer 2 out the 10G network ports.   This disrupted
ARP caches on every machine on every VLAN that was trunked on these ports,
not just the VLANs that were configured on the server.   The switches
reported every port healthy and without error.   The traffic on the bad
port was not high either, just severely disruptive.

The affected OmniOS box appear to be healthy, as it was still serving the
VM data stores for over 350 virtual machines.   However, it like every
other service on the network appeared to be up and down repeatedly, but NFS
kept on recovering gracefully.

The only thing that finally identified this server was when one of us plug
a monitor to the console and saw "WARNING: proxy ARP problem?"  happening
so fast that it took taking a cellphone picture of it a high frame rate to
read it.   Powering off this server, cleared the problem for the entire
network, and its pools were taken over by its HA sister.

Googling for that warning brings up nothing useful.

Has anyone ever seen a problem like this?   How did you locate it?

-Chip
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-09 Thread Ludovic Orban
Don't worry too much about it, I have a workaround in place that suits me
well.

I just wanted to report this problem.

Thanks,
Ludovic

On Tue, May 9, 2017 at 5:05 PM, Dan McDonald  wrote:

>
> > On May 9, 2017, at 10:30 AM, Dan McDonald  wrote:
> >
> > When I get home I'll provide more details, but you should try bloody or
> still in beta r151022.
> >
>
> Well shoot.  It appears I have pretty much the same problem in OmniOS
> bloody (and therefore r151022):
>
> root@ubuntu-14-04-b:~# ksh93
> # echo "Hello, world."
> ^D
> ^C
> # ls /
> bin   dev  home  lib64  mnt opt   root  sbin  sys tmp  var
> boot  etc  lib   media  native  proc  run   srv   system  usr
>
> ^D
> ^C#
> #
>
> Any command I type that's builtin just stops.  Any command that forks
> outputs, but doesn't seem to properly exit.
>
> Ouch... and when I try it again from scratch, it hangs per the original
> report.  And worse, ONE TIME (can't reproduce) it seemed to runaway with
> spawning itself.  Apparently if the first thing you type is a builtin or
> just RETURN, you enter a state where you can amp-off processes, but the
> shell itself seems to wedge until you hit ^C.  If you enter a real process
> right off the bat, ksh93 goes on a slow forking spree... well, sometimes.
> Other times, hitting ^C in the hung shell will return you.
>
> I wonder if I mismerged or forgot to merge something from SmartOS?
>
> I also wonder if this mismerge or omission happened early on in the prior
> bloody cycle?  It's going to be hard to tell.
>
> Whatever ksh93 is doing, it appears to be doing it in LX userspace, too:
>
> bloody(~)[0]% ptree `pgrep ksh93`
> 493   /usr/sbin/sshd
>   9511  /usr/sbin/sshd -R
> 9515  /usr/sbin/sshd -R
>   9516  -tcsh
> 9555  sudo zlogin lx1
>   9556  zlogin lx1
> 9557  /bin/login -h zone:global -f root
>   9566  -bash
> 9622  ksh93
> bloody(~)[0]% sudo mdb -k
> Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix
> scsi_vhci zfs sata sd ip hook neti sockfs arp usba xhci stmf stmf_sbd mm
> lofs random crypto idm nfs cpc ufs logindmux ptm ipc ]
> > ::ps !grep ksh
> R   9622   9566   9622   9557  0 0x4a014000 ff1983ee5000 ksh93
> > ff1983ee5000::ps -t
> SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
> R   9622   9566   9622   9557  0 0x4a014000 ff1983ee5000 ksh93
> T  0xff1955982c60 
> > 0xff1955982c60::findstack
> stack pointer for thread ff1955982c60: ff007ae157f0
>   ff007ae15f10 0x20()
> >
>
> And I've no good way to know what it's doing, as the illumos-native tools
> aren't giving me enough data.
>
> Sorry,
> Dan
>
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-09 Thread Dan McDonald

> On May 9, 2017, at 10:30 AM, Dan McDonald  wrote:
> 
> When I get home I'll provide more details, but you should try bloody or still 
> in beta r151022.
> 

Well shoot.  It appears I have pretty much the same problem in OmniOS bloody 
(and therefore r151022):

root@ubuntu-14-04-b:~# ksh93
# echo "Hello, world."
^D
^C
# ls /
bin   dev  home  lib64  mnt opt   root  sbin  sys tmp  var
boot  etc  lib   media  native  proc  run   srv   system  usr

^D
^C# 
# 

Any command I type that's builtin just stops.  Any command that forks outputs, 
but doesn't seem to properly exit.

Ouch... and when I try it again from scratch, it hangs per the original report. 
 And worse, ONE TIME (can't reproduce) it seemed to runaway with spawning 
itself.  Apparently if the first thing you type is a builtin or just RETURN, 
you enter a state where you can amp-off processes, but the shell itself seems 
to wedge until you hit ^C.  If you enter a real process right off the bat, 
ksh93 goes on a slow forking spree... well, sometimes.  Other times, hitting ^C 
in the hung shell will return you.

I wonder if I mismerged or forgot to merge something from SmartOS?

I also wonder if this mismerge or omission happened early on in the prior 
bloody cycle?  It's going to be hard to tell.

Whatever ksh93 is doing, it appears to be doing it in LX userspace, too:

bloody(~)[0]% ptree `pgrep ksh93`
493   /usr/sbin/sshd
  9511  /usr/sbin/sshd -R
9515  /usr/sbin/sshd -R
  9516  -tcsh
9555  sudo zlogin lx1
  9556  zlogin lx1
9557  /bin/login -h zone:global -f root
  9566  -bash
9622  ksh93
bloody(~)[0]% sudo mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix 
scsi_vhci zfs sata sd ip hook neti sockfs arp usba xhci stmf stmf_sbd mm lofs 
random crypto idm nfs cpc ufs logindmux ptm ipc ]
> ::ps !grep ksh
R   9622   9566   9622   9557  0 0x4a014000 ff1983ee5000 ksh93
> ff1983ee5000::ps -t
SPID   PPID   PGIDSIDUID  FLAGS ADDR NAME
R   9622   9566   9622   9557  0 0x4a014000 ff1983ee5000 ksh93
T  0xff1955982c60 
> 0xff1955982c60::findstack
stack pointer for thread ff1955982c60: ff007ae157f0
  ff007ae15f10 0x20()
> 

And I've no good way to know what it's doing, as the illumos-native tools 
aren't giving me enough data.

Sorry,
Dan

___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-09 Thread Dan McDonald
When I get home I'll provide more details, but you should try bloody or still 
in beta r151022.

Dan

Sent from my iPhone (typos, autocorrect, and all)

> On May 9, 2017, at 10:02 AM, Nahum Shalman  wrote:
> 
> As a data point, I tested this on a very recent SmartOS and was unable to 
> reproduce:
> 
> root@ksh-debian:~# set -o xtrace ; /native/usr/bin/uname -a; uname -a ; cat 
> /etc/debian_version ; dpkg-query -l ksh | grep ksh ; /bin/ksh93
> + set -o xtrace
> + /native/usr/bin/uname -a
> SunOS ksh-debian 5.11 joyent_20170427T222500Z i86pc i386 i86pc
> + uname -a
> Linux ksh-debian 3.16.10 BrandZ virtual linux x86_64 GNU/Linux
> + cat /etc/debian_version
> 8.8
> + dpkg-query -l ksh
> + grep ksh
> ii  ksh93u+20120801-1 amd64Real, AT version of the Korn 
> shell
> + /bin/ksh93
> # ps | grep $$
> 77074 pts/200:00:00 ksh93
> # exit
> root@ksh-debian:~#
> 
>> On Tue, May 9, 2017 at 5:52 AM, Ludovic Orban  wrote:
>> Hi,
>> 
>> I've installed the real ksh93 (this stuff: 
>> https://packages.debian.org/jessie/ksh) on a r151020 LX zone running the 
>> latest Debian 
>> (https://images.joyent.com/images/e74a9cd0-f2d0-11e6-8b69-b3acf2ef87f7) and 
>> it behaves strangely.
>> 
>> Here is what I get:
>> 
>> root@debian-8:~# /bin/ksh93
>> # ls /
>> 
>> 
>> ^C^C
>> ^Z^Z^Z
>> ^\^\^\^\
>> Killed
>> root@debian-8:~#
>> root@debian-8:~#
>> 
>> (note: the "Killed" line is due to a kill -9 from another terminal).
>> 
>> Basically, any command I type just hangs there. Sometimes the shell reacts 
>> to ^C and/or ^D allowing me to try a different command that also gonna hang, 
>> and sometimes it doesn't and I have to kill the shell from another terminal. 
>> In the latter case, ksh93 starts kind of a "fork bomb" where it seems to 
>> fork itself in a loop.
>> 
>> I can reproduce this problem from a zlogin term as well as from a direct ssh 
>> session. I've also tried ksh93 on a CentOS zone to make sure the problem 
>> wasn't caused by Debian's build and the exact same problem arises.
>> 
>> Any idea what's going on?
>> 
>> Thanks,
>> Ludovic
>> 
>> ___
>> OmniOS-discuss mailing list
>> OmniOS-discuss@lists.omniti.com
>> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>> 
> 
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] LX: real ksh93 broken

2017-05-09 Thread Nahum Shalman
As a data point, I tested this on a very recent SmartOS and was unable to
reproduce:

root@ksh-debian:~# set -o xtrace ; /native/usr/bin/uname -a; uname -a ; cat
/etc/debian_version ; dpkg-query -l ksh | grep ksh ; /bin/ksh93
+ set -o xtrace
+ /native/usr/bin/uname -a
SunOS ksh-debian 5.11 joyent_20170427T222500Z i86pc i386 i86pc
+ uname -a
Linux ksh-debian 3.16.10 BrandZ virtual linux x86_64 GNU/Linux
+ cat /etc/debian_version
8.8
+ dpkg-query -l ksh
+ grep ksh
ii  ksh93u+20120801-1 amd64Real, AT version of the
Korn shell
+ /bin/ksh93
# ps | grep $$
77074 pts/200:00:00 ksh93
# exit
root@ksh-debian:~#

On Tue, May 9, 2017 at 5:52 AM, Ludovic Orban  wrote:

> Hi,
>
> I've installed the real ksh93 (this stuff: https://packages.debian.org/
> jessie/ksh) on a r151020 LX zone running the latest Debian (
> https://images.joyent.com/images/e74a9cd0-f2d0-11e6-8b69-b3acf2ef87f7)
> and it behaves strangely.
>
> Here is what I get:
>
> root@debian-8:~# /bin/ksh93
> # ls /
>
>
> ^C^C
> ^Z^Z^Z
> ^\^\^\^\
> Killed
> root@debian-8:~#
> root@debian-8:~#
>
> (note: the "Killed" line is due to a kill -9 from another terminal).
>
> Basically, any command I type just hangs there. Sometimes the shell reacts
> to ^C and/or ^D allowing me to try a different command that also gonna
> hang, and sometimes it doesn't and I have to kill the shell from another
> terminal. In the latter case, ksh93 starts kind of a "fork bomb" where it
> seems to fork itself in a loop.
>
> I can reproduce this problem from a zlogin term as well as from a direct
> ssh session. I've also tried ksh93 on a CentOS zone to make sure the
> problem wasn't caused by Debian's build and the exact same problem arises.
>
> Any idea what's going on?
>
> Thanks,
> Ludovic
>
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
>
>
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] LX: real ksh93 broken

2017-05-09 Thread Ludovic Orban
Hi,

I've installed the real ksh93 (this stuff:
https://packages.debian.org/jessie/ksh) on a r151020 LX zone running the
latest Debian (
https://images.joyent.com/images/e74a9cd0-f2d0-11e6-8b69-b3acf2ef87f7) and
it behaves strangely.

Here is what I get:

root@debian-8:~# /bin/ksh93
# ls /


^C^C
^Z^Z^Z
^\^\^\^\
Killed
root@debian-8:~#
root@debian-8:~#

(note: the "Killed" line is due to a kill -9 from another terminal).

Basically, any command I type just hangs there. Sometimes the shell reacts
to ^C and/or ^D allowing me to try a different command that also gonna
hang, and sometimes it doesn't and I have to kill the shell from another
terminal. In the latter case, ksh93 starts kind of a "fork bomb" where it
seems to fork itself in a loop.

I can reproduce this problem from a zlogin term as well as from a direct
ssh session. I've also tried ksh93 on a CentOS zone to make sure the
problem wasn't caused by Debian's build and the exact same problem arises.

Any idea what's going on?

Thanks,
Ludovic
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss