Snooping showed that the receiving end sent RST packets.
My TCP is quite rusty, but if I remember well, RST does not mean "close the connection". Does anyone have a clue why the sending side chooses to close the connection ?

Furthermore, the receiving end doesn't even get FIN packets from the sending end. Again, if I remember well, one should send FIN packets and wait for FIN/ACK (or timeout I guess) before closing the connection.

I'm in the process of upgrading to 132 to see if it fixes this problem (although the changelog doesn't mention anything related to tcp problems).

Arnaud

Le 05/02/10 01:32, Arnaud Brand a écrit :
Hi Min,

The transfer failed after 9h43 / 880GB.
I'm going to restart it and snoop to a file in the hope I see something.

If you have some dtrace script to see what's going on or some system setting to get logs of what might be going on, I would be more that happy to test it.

Thanks,
Arnaud

Le 04/02/10 11:12, Arnaud Brand a écrit :
Hi Min,

I had problems with build 130, and saw that the e1000g driver was updated in 131. That's why we updated. I haven't tested with previous builds and was not that extensive with my tests in 130, so I cannot tell wether it's the same problem or not.

The kstat command returns
        Reset Count                     0
        Reset Count                     0
on both nodes.

A colleague of mine connected the two machines back to back (I'm at home, sick, not at work). I restarted the transfers that failed last night, I'll keep you posted about it (already done 50GB).

I know this switch (HP 4208) doesn't support jumbo frames, so I haven't enabled jumbo on the nodes. If this tranfer works, could this mean our switch is kind of broken or buggy ?

Regarding the replacement with other builds, I can do it on the receiving end, but not that easily on the sending end (the one that "looses" the connection).

Thanks,
Arnaud

Le 04/02/2010 07:07, Min Miles Xu a écrit :
Hi Arnaud,

On which build you started to notice the issue? Could you still ping through when you noticed the error? What's the output of "kstat -m e1000g |grep Reset"? Recv_Length_Errors indicates the packets received are undersized(< 64 bytes) or oversized (> 1522 bytes when jumbo frames aren't enabled). I expect to narrow down the issue by simplying the network configurations You mentioned the two machines are connected via a switch. Could you try to have the two machines connected back-to-back? Further more, Could you try to replace one machine with other builds/OSs?

Thanks,

Miles Xu

Arnaud Brand wrote:
Hi folks,

My situation is the following : 2 computers (A and B) running Opensolaris b131 having intel 82574L NICs, connected through an HP4208 switch.
Both computers are on the same network.
I have transfers running from computer A to computer B, either through ssh or netcat.

As long a computer B is not too busy, the transfer goes like a charm.
But when B's really busy (doing zfs recv from a local file in this case) , the transfer fails is an odd way after some time (tests show somewhere between 10 minutes and 13 hours).

What's odd is that A reports that he could not read from B and closes the connection (no sign of it in netstat), but B still thinks the connection is open. Further, running "kstat -p | grep e1000g | grep -i err" on A show all zeroes but for the following :
e1000g:1:statistics:Recv_Length_Errors  14
link:0:e1000g1:ierrors  14
e1000g:1:mac:ierrors    14

More details on the test cases is available there :
http://opensolaris.org/jive/thread.jspa?threadID=122977&tstart=0

You can see that Brent Jones mentionned the following CR but this is marked as a dupplicate of something fixed in 131.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510

I did not do any twiddling in e1000g.conf.
Both e1000g are grouped in a aggregation named trk0.
Per advice of Richard Elling, I disabled LACP and, just to be sure, I unplugged one network cable on each machine.

If any of you has any clue or workaround to try, please share.

Thanks,
Arnaud

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to