NAT machine. Help needed

Svavar Örn Eysteinsson Fri, 09 Oct 2009 06:46:06 -0700

On 8.10.2009, at 23:18, Graham, David wrote:

> Svavar Örn Eysteinsson wrote:
>> I installed the new driver with your procedures below, and issued a
>> "modprobe e1000 TxDescriptorStep=4,4,4
>> My dmesg logs finally showed parameters :
> Good. I don't follow a lot of the network manager messages, but most  
> importantly, the TXDescriptorStep parameter is applied on each of  
> your 3 interfaces, and they all came up..
>> Then I issued a "ethtool -K eth0 tso off"
>> Oct  7 16:43:56 localhost kernel: e1000: eth0: e1000_set_tso: TSO is
>> Disabled
>> Then I asked my collogue of mine to issue a fast, and large download
>> from one of my FTP servers.
>> As soon as he reached about 10 - 11 MB/s in transfer rate it looks
>> like the eth0 interface just goes down, and pops up somewhat again.
>> I get these in my logs :
>> Oct  7 16:55:36 localhost kernel: e1000: eth0: e1000_watchdog_task:
>> NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
>> (I think this comes right after I execute my firewall script)
>



> I'm a bit confused. I can see that the eth0 link goes down, and then  
> up again - but when exactly ?  Is it occurring after you execute the  
> firewall script if the ftp data rate is already > 10MB/s ? Does the  
> link  bounce down & back if the firewall script isn't executed ?  
> Does the link bounce if the firewall script is executed even if  
> there's no traffic ?

I can't test any connections outside or inside my servers without  
loading my firewallscript because of the NAT, and rules. So the links  
bounce down and back after I have loaded the script.

This is a very strange problem. Suddenly the message below pops up  
after, what 1-2min from the time my friend started the download.
It states that the interface goes down, and comes up again. But my  
friend is still downloading at the time.

Every connection from my LAN to OUTSIDE world works. Then I always get  
a Phone call from customers that have websites, and email access from  
my NAT'd servers
that they can't access anything .... STRANGE, as traffic from LAN  
works. So I execute my firewall script, and it pops up again like  
nothing happens :S

>
> And are there no longer any "Detected Tx Unit Hang" messages in the  
> log ? If the TX Hang messages are gone, the initial issue may  
> already be resolved.
>
> Knowing the answer to these questions will help me to decide what to  
> do next, and to more closely align my system for reproduction of the  
> issue you are seeing.

The Tx Unit hang messages don't log anymore in my logs. So the TX hang  
issue is gone I think.
After I execute my firewall script I tell my friend to download from  
my FTP server.
Then some minutes later suddenly these messages pops in my logs  
(note : PUBLIC_IP was replaced) :

Oct  9 09:15:54 localhost NetworkManager: <info>  (eth0): carrier now  
OFF (device state 8)
Oct  9 09:15:54 localhost NetworkManager: <info>  (eth0): device state  
change: 8 -> 2
Oct  9 09:15:54 localhost NetworkManager: <info>  (eth0): deactivating  
device (reason: 40).
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_route():  
(eth0) error -34 returned from rtnl_route_del(): Sucess#012
Oct  9 09:15:54 localhost avahi-daemon[2465]: Withdrawing address  
record for PUBLIC_IP on eth0.
Oct  9 09:15:54 localhost avahi-daemon[2465]: Leaving mDNS multicast  
group on interface eth0.IPv4 with address PUBLIC_IP.
Oct  9 09:15:54 localhost avahi-daemon[2465]: Interface eth0.IPv4 no  
longer relevant for mDNS.
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:54 localhost NetworkManager: <WARN>  check_one_address():  
(eth0) error -99 returned from rtnl_addr_delete(): Sucess#012
Oct  9 09:15:55 localhost kernel: e1000: eth0: e1000_watchdog_task:  
NIC Link is Up 100 Mbps Full Duplex, Flow Control: RX/TX
Oct  9 09:15:55 localhost NetworkManager: <info>  (eth0): carrier now  
ON (device state 2)
Oct  9 09:15:55 localhost NetworkManager: <info>  (eth0): device state  
change: 2 -> 3
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
starting connection 'System eth0'
Oct  9 09:15:55 localhost NetworkManager: <info>  (eth0): device state  
change: 3 -> 4
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 1 of 5 (Device Prepare) scheduled...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 1 of 5 (Device Prepare) started...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 2 of 5 (Device Configure) scheduled...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 1 of 5 (Device Prepare) complete.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 2 of 5 (Device Configure) starting...
Oct  9 09:15:55 localhost NetworkManager: <info>  (eth0): device state  
change: 4 -> 5
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 2 of 5 (Device Configure) successful.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 3 of 5 (IP Configure Start) scheduled.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 2 of 5 (Device Configure) complete.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 3 of 5 (IP Configure Start) started...
Oct  9 09:15:55 localhost NetworkManager: <info>  (eth0): device state  
change: 5 -> 7
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 4 of 5 (IP Configure Get) scheduled...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 3 of 5 (IP Configure Start) complete.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 4 of 5 (IP Configure Get) started...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 5 of 5 (IP Configure Commit) scheduled...
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 4 of 5 (IP Configure Get) complete.
Oct  9 09:15:55 localhost NetworkManager: <info>  Activation (eth0)  
Stage 5 of 5 (IP Configure Commit) started...
Oct  9 09:15:55 localhost avahi-daemon[2465]: Joining mDNS multicast  
group on interface eth0.IPv4 with address PUBLIC_IP.
Oct  9 09:15:55 localhost avahi-daemon[2465]: New relevant interface  
eth0.IPv4 for mDNS.
Oct  9 09:15:55 localhost avahi-daemon[2465]: Registering new address  
record for PUBLIC_IP on eth0.IPv4.
Oct  9 09:15:56 localhost NetworkManager: <info>  (eth0): device state  
change: 7 -> 8
Oct  9 09:15:56 localhost NetworkManager: <info>  Policy set 'System  
eth0' (eth0) as default for routing and DNS.
Oct  9 09:15:56 localhost NetworkManager: <info>  Activation (eth0)  
successful, device activated.
Oct  9 09:15:56 localhost NetworkManager: <info>  Activation (eth0)  
Stage 5 of 5 (IP Configure Commit) complete.


These messages are have clearly nothing to do with e1000, but somewho  
the NetworkManager daemon and avahi-daemon.
They state that my interface goes down, and comes up. But when it  
comes up again, I do have to relaunch my firewall script to
load the rules. So I thought that this problem was releated to my  
public SWTICH.. I have no replaced my procurve 2524 (which is only  
100Mbit)
do a dummy Linksys 24 port 1000Mbit switch.

Then my firewall logs : e1000: eth1: e1000_watchdog: NIC Link is Up  
1000 Mbps Full Duplex, Flow Control: RX/TX
notifying my of the gigabit connection.

As I totally forgot to mention, that I also have router PC machine in  
front, that also has the e1000 (same cards) running ZEBRA and only  
Zebra.
What this machine does is only allows packet forwarding with Zebra.

I also updated the drivers from 7.3 to 8.0.16 in that machine today,  
and issued a TSO Off on those cards. (also the TxDescriptorStep=4,4 )

So, my Firewall and my Router are connected right now to a Gigabit  
switch, and state'ing : 1000 Mbps Full Duplex, Flow Control: RX/TX.



>
>> One thing I don't understand is, if the interface goes down and comes
>> right up after 1-2sec or so, why doesn't the firewall (netfilter)
>> rules hang in ?
>> Why do I have to relaunch interface config, and netfilter rules ?

> I don't know. Its possible (though unlikely) that if the one of the  
> Gateway interfaces is DHCP served, it could come back up and be  
> served a new IP address, which would break any existing connections.  
> As I say this is not likely. I am running a NAT firewall now (with  
> 82541PI on the 'private' side, and an 82572EI on the 'public'  
> interface, in an attampt to repro this very issue, and see only a  
> momentary interruption in continuous traffic streaming when I ifdown  
> and then ifup either the public or private interfaces. Do you think  
> it is possible that your firewall itself may be causing a problem ?  
> If your firewall is simply a set of IPTables rules, we could try and  
> run it (or something like it) here.

I don't think that the firewall rules have nothing to do with this  
problem, as I have used these rules and script (as I extend it of  
course and configure to my needs in time to time) on 3 pieces of PC  
machines or so.
Yes, my firewall script is generated with fwbuilder.



>> At the end, I will defiantly try to replace my HP Procurve 2524 with
>> another one, but I don't think the Procurve is the problem.... Or  
>> what?
>> As today the port, and the interface are both configured with  
>> AutoNeg,
>> And Full Duplex…
> I agree, the switch is probably not the problem here.

I just had to check it. replaced it with a Gigabit switch to see if  
the procurve just didn't handle all the load on the 100MBit Fiber.
10 - 11MB/s is clearly all that the link can handle.



So, this is the stats as for now :

1. Replaced my Switch to a Gigabit switch.
2. Updated the 7.3 (e1000) drivers to 8.0.16 on my Router.
3. Both Firewall, and Router machine have TSO turned off.
4. Both Firewall and Router state that they have 1000Mbs Full Duplex,  
Flow Control : RX/TX connection to my switch.

     (So I would think it can handle my 100MB fiber connection)

Today I will try to load up my link with fast and huge data.


Thanks allot Dave for your time and solutions.

And sorry for my bad english writing.

Will post status today.






>> As before, I had a another firewall server that had the old 3COM  
>> 3C905
>> cards.
>> In time to time, I also got Tx Unit hangs on those cards, but the
>> internet link, and or netfilter rules, network configuration never
>> crashed.
>> It just keep going and going regard of those tx unit hangs.
>> That was one of my main reasons I upgraded to INTEL e1000 cards. To
>> upgrade my old PC, 3com cards and to handle my 100MB fiber dark fiber
>> with ~200 devices connected in 3-4 networks doing NAT and pretty  
>> things.
>> Correct me if I'm wrong, this Tx Unit hang problem on e1000 is what
>> most related to AMD, and or AMD chipset platforms ?
> Yes, you are right. There are a lot of things that can cause a "TX  
> Unit Hang", but I was hoping that this one was the one that we had  
> already tied to older AMD platforms.
>> Today I found a old PC that only collects dust in my company. It has
>> the legendary Intel 440BX (same as in Cisco Pix) chipset and also 2x
>> SMP Intel Celeron 533 CPU's.
>> Would it be a problem solver to change my rusty AMD, VIA combination
>> in my current firewall to a rock solid 440BX with Intel CPU's ?
> I would hope so.
>> I remember, that I never had any problems at all with PIII and or
>> Intel chipset mobos in the past.
>> Thanks allot.
>> In desperate need for help. :)
>> Best regards,
>> Svavar O
>> Reykjavik - Iceland
>



------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel

Re: [E1000-devel] TX Unit hang with Intel Pro/1000 (82541PI) (e1000) driver on Firewall/NAT machine. Help needed

Reply via email to