Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Dale Ghent

> On May 11, 2016, at 12:32 PM, Stephan Budach  wrote:
> I will try to get one node free of all services running on it, as I will have 
> to reboot the system, since I will have to change the ixgbe.conf, haven't I?
> This is a RSF-1 host, so this will likely be done over the weekend.

You can use dladm on a live system:

dladm set-linkprop -p flowctrl=no ixgbeN

Where ixgbeN is your ixgbe interfaces (probably ixgbe0 and ixgbe1)

/dale



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Stephan Budach

Am 11.05.16 um 16:48 schrieb Dale Ghent:

On May 11, 2016, at 7:36 AM, Stephan Budach  wrote:

Am 09.05.16 um 20:43 schrieb Dale Ghent:

On May 9, 2016, at 2:04 PM, Stephan Budach  wrote:

Am 09.05.16 um 16:33 schrieb Dale Ghent:

On May 9, 2016, at 8:24 AM, Stephan Budach  wrote:

Hi,

I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the 
LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first 
starts with a couple if link downs/ups on one port and finally the link on that 
 port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel 
on my Cisco Nexus for this connection.

I have tried swapping and interchangeing cables and thus switchports, but to no 
avail.

Anyone else noticed this and even better… knows a solution to this?

Was this an issue noticed only with r151018 and not with previous versions, or 
have you only tried this with 018?

By your description, I presume that the two ixgbe physical links will stay at 
10Gb and not bounce down to 1Gb if not LACP'd together?

/dale

I have noticed that on prior versions of OmniOS as well, but we only recently 
started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our 
network. I will have to check if both links stay at 10GbE, when not being 
configured as a LACP bond. Let me check that tomorrow and report back. As we're 
heading for a streched DC, we are mainly configuring 2-way LACP bonds over our 
Nexus gear, so we don't actually have any single 10GbE connection, as they will 
all have to be conencted to both DCs. This is achieved by using VPCs on our 
Nexus switches.

Provide as much detail as you can - if you're using hw flow control, whether 
both links act this way at the same time or independently, and so-on. Problems 
like this often boil down to a very small and seemingly insignificant detail.

I currently have ixgbe on the operating table for adding X550 support, so I can 
take a look at this; however I don't have your type of switches available to me 
so LACP-specific testing is something I can't do for you.

/dale

I checked the ixgbe.conf files on each host and they all are still at the 
standard setting, which includes flow_control = 3;

As, so you are using ethernet flow control. Could you try disabling that on 
both sides (on the ixgbe host and on the switch) and see if that corrects the 
link stability issues? There's an outstanding issue with hw flow control on 
ixgbe that you *might* be running into regarding pause frame timing, which 
could manifest in the way you describe.

/dale

I will try to get one node free of all services running on it, as I will 
have to reboot the system, since I will have to change the ixgbe.conf, 
haven't I?

This is a RSF-1 host, so this will likely be done over the weekend.

Thanks,
Stephan
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Dale Ghent

> On May 11, 2016, at 7:36 AM, Stephan Budach  wrote:
> 
> Am 09.05.16 um 20:43 schrieb Dale Ghent:
>>> On May 9, 2016, at 2:04 PM, Stephan Budach  wrote:
>>> 
>>> Am 09.05.16 um 16:33 schrieb Dale Ghent:
> On May 9, 2016, at 8:24 AM, Stephan Budach  wrote:
> 
> Hi,
> 
> I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break 
> the LACP aggr-link on different boxes, when Intel X540-T2s are involved. 
> It first starts with a couple if link downs/ups on one port and finally 
> the link on that  port negiotates to 1GbE instead of 10GbE, which then 
> breaks the LACP channel on my Cisco Nexus for this connection.
> 
> I have tried swapping and interchangeing cables and thus switchports, but 
> to no avail.
> 
> Anyone else noticed this and even better… knows a solution to this?
 Was this an issue noticed only with r151018 and not with previous 
 versions, or have you only tried this with 018?
 
 By your description, I presume that the two ixgbe physical links will stay 
 at 10Gb and not bounce down to 1Gb if not LACP'd together?
 
 /dale
>>> I have noticed that on prior versions of OmniOS as well, but we only 
>>> recently started deploying 10GbE LACP bonds, when we introduced our Nexus 
>>> gear to our network. I will have to check if both links stay at 10GbE, when 
>>> not being configured as a LACP bond. Let me check that tomorrow and report 
>>> back. As we're heading for a streched DC, we are mainly configuring 2-way 
>>> LACP bonds over our Nexus gear, so we don't actually have any single 10GbE 
>>> connection, as they will all have to be conencted to both DCs. This is 
>>> achieved by using VPCs on our Nexus switches.
>> Provide as much detail as you can - if you're using hw flow control, whether 
>> both links act this way at the same time or independently, and so-on. 
>> Problems like this often boil down to a very small and seemingly 
>> insignificant detail.
>> 
>> I currently have ixgbe on the operating table for adding X550 support, so I 
>> can take a look at this; however I don't have your type of switches 
>> available to me so LACP-specific testing is something I can't do for you.
>> 
>> /dale
> I checked the ixgbe.conf files on each host and they all are still at the 
> standard setting, which includes flow_control = 3;

As, so you are using ethernet flow control. Could you try disabling that on 
both sides (on the ixgbe host and on the switch) and see if that corrects the 
link stability issues? There's an outstanding issue with hw flow control on 
ixgbe that you *might* be running into regarding pause frame timing, which 
could manifest in the way you describe.

/dale



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Stephan Budach

Am 11.05.16 um 14:50 schrieb Stephan Budach:

Am 11.05.16 um 13:36 schrieb Stephan Budach:

Am 09.05.16 um 20:43 schrieb Dale Ghent:
On May 9, 2016, at 2:04 PM, Stephan Budach  
wrote:


Am 09.05.16 um 16:33 schrieb Dale Ghent:
On May 9, 2016, at 8:24 AM, Stephan Budach 
 wrote:


Hi,

I have a strange behaviour where OmniOS omnios-r151018-ae3141d 
will break the LACP aggr-link on different boxes, when Intel 
X540-T2s are involved. It first starts with a couple if link 
downs/ups on one port and finally the link on that  port 
negiotates to 1GbE instead of 10GbE, which then breaks the LACP 
channel on my Cisco Nexus for this connection.


I have tried swapping and interchangeing cables and thus 
switchports, but to no avail.


Anyone else noticed this and even better… knows a solution to this?
Was this an issue noticed only with r151018 and not with previous 
versions, or have you only tried this with 018?


By your description, I presume that the two ixgbe physical links 
will stay at 10Gb and not bounce down to 1Gb if not LACP'd together?


/dale
I have noticed that on prior versions of OmniOS as well, but we 
only recently started deploying 10GbE LACP bonds, when we 
introduced our Nexus gear to our network. I will have to check if 
both links stay at 10GbE, when not being configured as a LACP bond. 
Let me check that tomorrow and report back. As we're heading for a 
streched DC, we are mainly configuring 2-way LACP bonds over our 
Nexus gear, so we don't actually have any single 10GbE connection, 
as they will all have to be conencted to both DCs. This is achieved 
by using VPCs on our Nexus switches.
Provide as much detail as you can - if you're using hw flow control, 
whether both links act this way at the same time or independently, 
and so-on. Problems like this often boil down to a very small and 
seemingly insignificant detail.


I currently have ixgbe on the operating table for adding X550 
support, so I can take a look at this; however I don't have your 
type of switches available to me so LACP-specific testing is 
something I can't do for you.


/dale
I checked the ixgbe.conf files on each host and they all are still at 
the standard setting, which includes flow_control = 3;
So they all have flow control enabled. As for the Nexus config, all 
of those ports are still on standard ethernet ports and modifications 
have only been made globally to the switch.
I will now have to yank the one port on one of the hosts from the 
aggr and configure it as a standalone port. Then we will see, if it 
still receives the disconnects/reconnects and finally the negotiation 
to 1GbE instead of 10GbE. As this only seems to happen to the same 
port I never experienced other ports of the affected aggrs acting up. 
I also thought to notice, that those were always the "same" physical 
ports, that is the first port on the card (ixgbe0), but that might of 
course be a coincidence.


Thanks,
Stephan


Ok, so we can likely rule out LACP as a generic reason for this issue… 
After removing ixgbe0 from the aggr1, I plugged it into an unused port 
of my Nexus FEX and low and behold, here we go:


root@tr1206902:/root# tail -f /var/adm/messages
May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
link up, 1000 Mbps, full duplex
May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 
link down
May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
link up, 1 Mbps, full duplex


May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 
link down
May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
link up, 1 Mbps, full duplex


So, after less than an hour, we had the first link-cycle on ixgbe0, 
alas on another port, which has no LACP config whatsoever. I will 
monitor this for a while and see, if we will get more of those.


Thanks,
Stephan 


Ehh… and sorry, I almost forgot to paste the log from the Cisco Nexus 
switch:


2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-SPEED: Interface 
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface 
Ethernet141/1/9, operational duplex mode changed to Full
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Receive Flow Control state changed to off
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Transmit Flow Control state changed to on
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_UP: Interface 
Ethernet141/1/9 is up in mode access
2016 May 11 14:07:29 gh79-nx-01 %ETHPORT-5-IF_DOWN_LINK_FAILURE: 
Interface Ethernet141/1/9 is down (Link failure)

2016 May 11 14:07:45 gh79-nx-01 last message repeated 1 time
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-SPEED: Interface 
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface 

Re: [OmniOS-discuss] sudden reboot

2016-05-11 Thread Dan McDonald
You had a kernel panic. Can you share that vmdump.0 file?

Dan

Sent from my iPhone (typos, autocorrect, and all)

> On May 11, 2016, at 4:24 AM, Martijn Fennis  wrote:
> 
> Hi,
>  
> I’m experiencing an unexpected reboot.
>  
> System is supermicro with ECC mem, qlogic FC and LSI SAS.
>  
> Temperatures and voltages look OK.
>  
> The message i find is about the express bus… but how to find the cause? 
> Should i set something like IRQ-steering or so in the BIOS?
>  
> May 11 10:01:09 ZFS01 savecore: [ID 570001 auth.error] reboot after panic: 
> pcieb-0: PCI(-X) Express Fatal Error. (0x101)
> May 11 10:01:09 ZFS01 savecore: [ID 365739 auth.error] Saving compressed 
> system crash dump in /var/crash/unknown/vmdump.0
>  
> Thanks,
>  
> Martijn
> ___
> OmniOS-discuss mailing list
> OmniOS-discuss@lists.omniti.com
> http://lists.omniti.com/mailman/listinfo/omnios-discuss
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Stephan Budach

Am 11.05.16 um 13:36 schrieb Stephan Budach:

Am 09.05.16 um 20:43 schrieb Dale Ghent:
On May 9, 2016, at 2:04 PM, Stephan Budach  
wrote:


Am 09.05.16 um 16:33 schrieb Dale Ghent:
On May 9, 2016, at 8:24 AM, Stephan Budach  
wrote:


Hi,

I have a strange behaviour where OmniOS omnios-r151018-ae3141d 
will break the LACP aggr-link on different boxes, when Intel 
X540-T2s are involved. It first starts with a couple if link 
downs/ups on one port and finally the link on that  port 
negiotates to 1GbE instead of 10GbE, which then breaks the LACP 
channel on my Cisco Nexus for this connection.


I have tried swapping and interchangeing cables and thus 
switchports, but to no avail.


Anyone else noticed this and even better… knows a solution to this?
Was this an issue noticed only with r151018 and not with previous 
versions, or have you only tried this with 018?


By your description, I presume that the two ixgbe physical links 
will stay at 10Gb and not bounce down to 1Gb if not LACP'd together?


/dale
I have noticed that on prior versions of OmniOS as well, but we only 
recently started deploying 10GbE LACP bonds, when we introduced our 
Nexus gear to our network. I will have to check if both links stay 
at 10GbE, when not being configured as a LACP bond. Let me check 
that tomorrow and report back. As we're heading for a streched DC, 
we are mainly configuring 2-way LACP bonds over our Nexus gear, so 
we don't actually have any single 10GbE connection, as they will all 
have to be conencted to both DCs. This is achieved by using VPCs on 
our Nexus switches.
Provide as much detail as you can - if you're using hw flow control, 
whether both links act this way at the same time or independently, 
and so-on. Problems like this often boil down to a very small and 
seemingly insignificant detail.


I currently have ixgbe on the operating table for adding X550 
support, so I can take a look at this; however I don't have your type 
of switches available to me so LACP-specific testing is something I 
can't do for you.


/dale
I checked the ixgbe.conf files on each host and they all are still at 
the standard setting, which includes flow_control = 3;
So they all have flow control enabled. As for the Nexus config, all of 
those ports are still on standard ethernet ports and modifications 
have only been made globally to the switch.
I will now have to yank the one port on one of the hosts from the aggr 
and configure it as a standalone port. Then we will see, if it still 
receives the disconnects/reconnects and finally the negotiation to 
1GbE instead of 10GbE. As this only seems to happen to the same port I 
never experienced other ports of the affected aggrs acting up. I also 
thought to notice, that those were always the "same" physical ports, 
that is the first port on the card (ixgbe0), but that might of course 
be a coincidence.


Thanks,
Stephan


Ok, so we can likely rule out LACP as a generic reason for this issue… 
After removing ixgbe0 from the aggr1, I plugged it into an unused port 
of my Nexus FEX and low and behold, here we go:


root@tr1206902:/root# tail -f /var/adm/messages
May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link 
up, 1000 Mbps, full duplex
May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link 
down
May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link 
up, 1 Mbps, full duplex


May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 link 
down
May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 link 
up, 1 Mbps, full duplex


So, after less than an hour, we had the first link-cycle on ixgbe0, alas 
on another port, which has no LACP config whatsoever. I will monitor 
this for a while and see, if we will get more of those.


Thanks,
Stephan
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


Re: [OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

2016-05-11 Thread Stephan Budach

Am 09.05.16 um 20:43 schrieb Dale Ghent:

On May 9, 2016, at 2:04 PM, Stephan Budach  wrote:

Am 09.05.16 um 16:33 schrieb Dale Ghent:

On May 9, 2016, at 8:24 AM, Stephan Budach  wrote:

Hi,

I have a strange behaviour where OmniOS omnios-r151018-ae3141d will break the 
LACP aggr-link on different boxes, when Intel X540-T2s are involved. It first 
starts with a couple if link downs/ups on one port and finally the link on that 
 port negiotates to 1GbE instead of 10GbE, which then breaks the LACP channel 
on my Cisco Nexus for this connection.

I have tried swapping and interchangeing cables and thus switchports, but to no 
avail.

Anyone else noticed this and even better… knows a solution to this?

Was this an issue noticed only with r151018 and not with previous versions, or 
have you only tried this with 018?

By your description, I presume that the two ixgbe physical links will stay at 
10Gb and not bounce down to 1Gb if not LACP'd together?

/dale

I have noticed that on prior versions of OmniOS as well, but we only recently 
started deploying 10GbE LACP bonds, when we introduced our Nexus gear to our 
network. I will have to check if both links stay at 10GbE, when not being 
configured as a LACP bond. Let me check that tomorrow and report back. As we're 
heading for a streched DC, we are mainly configuring 2-way LACP bonds over our 
Nexus gear, so we don't actually have any single 10GbE connection, as they will 
all have to be conencted to both DCs. This is achieved by using VPCs on our 
Nexus switches.

Provide as much detail as you can - if you're using hw flow control, whether 
both links act this way at the same time or independently, and so-on. Problems 
like this often boil down to a very small and seemingly insignificant detail.

I currently have ixgbe on the operating table for adding X550 support, so I can 
take a look at this; however I don't have your type of switches available to me 
so LACP-specific testing is something I can't do for you.

/dale
I checked the ixgbe.conf files on each host and they all are still at 
the standard setting, which includes flow_control = 3;
So they all have flow control enabled. As for the Nexus config, all of 
those ports are still on standard ethernet ports and modifications have 
only been made globally to the switch.
I will now have to yank the one port on one of the hosts from the aggr 
and configure it as a standalone port. Then we will see, if it still 
receives the disconnects/reconnects and finally the negotiation to 1GbE 
instead of 10GbE. As this only seems to happen to the same port I never 
experienced other ports of the affected aggrs acting up. I also thought 
to notice, that those were always the "same" physical ports, that is the 
first port on the card (ixgbe0), but that might of course be a coincidence.


Thanks,
Stephan
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss


[OmniOS-discuss] sudden reboot

2016-05-11 Thread Martijn Fennis
Hi,

I’m experiencing an unexpected reboot.

System is supermicro with ECC mem, qlogic FC and LSI SAS.

Temperatures and voltages look OK.

The message i find is about the express bus… but how to find the cause? Should 
i set something like IRQ-steering or so in the BIOS?

May 11 10:01:09 ZFS01 savecore: [ID 570001 auth.error] reboot after panic: 
pcieb-0: PCI(-X) Express Fatal Error. (0x101)
May 11 10:01:09 ZFS01 savecore: [ID 365739 auth.error] Saving compressed system 
crash dump in /var/crash/unknown/vmdump.0

Thanks,

Martijn
___
OmniOS-discuss mailing list
OmniOS-discuss@lists.omniti.com
http://lists.omniti.com/mailman/listinfo/omnios-discuss