Re: [networking-discuss] Help needed on big transfers failure with e1000g

Arnaud Brand Mon, 08 Feb 2010 15:35:11 -0800

Thanks for your reply.

Le 08/02/10 23:18, James Carlson a écrit :

Arnaud Brand wrote:

Snooping showed that the receiving end sent RST packets.
My TCP is quite rusty, but if I remember well, RST does not mean "close
the connection".

Correct.  It means "I don't know what you're talking about, so please go
away."

Yes, I reread tcp docs and figured out I was wrong. My Bad.

Does anyone have a clue why the sending side chooses to close the
connection ?

Causes for RST include:


   - peer application is intentionally setting the linger time to zero
     and issuing close(2), which results in TCP RST generation.

Might be possible, but I can't see why the receiving end would do that.

   - peer has crashed and rebooted, and thus no longer has that
     connection open.

Not the case (I had ssh connections open to both nodes and those keptrunning).

   - stateful middlebox (such as firewall or load-balancer) has lost
     state for the connection and is terminating it.

I followed some advice on this list and connected both servers directlywithout switch or firewall in the middle.

IPFilter is disabled on both servers.

   - network misconfiguration, such as duplicate IP addresses.

Not the case, both IP are really distinct (rechecked).

   - bugs in one or both peers (often related to TCP keepalive; key
     signature of such a problem is an apparent two-hour time limit).

That could be it, but I doubt it since disconnections appeared anywhererandomly in the range 10 minutes to 13 hours.It should be noted that the node sending the RST keeps the connectionopen (netstat -a shows its still established).

To be honest that puzzles me.

You (at least) have to analyze the packet sequences to determine what is
going wrong.  Depending on the nature of the problem, it may also take
in-depth kernel debugging on one or both peers to locate the cause.

I relaunched another transfer and I'm tcpdumping both servers in thehope that I find something.In the mean time I've received a beta bios from tyan which providessupport for IKVM over tagged VLANs.Until now the intel chips (on which the IKVM/IPMI card is piggy-backed)are working better than before.

I can't tell if it's related or not, I'm crossing fingers.

Regarding kernel debugging I though I would look for dtrace scripts, andfound some, but nothing that seemed relevant in my case.As I a complete beginner (read: copy-paste) in dtrace I couldn't yetfigure out how to write one myself.

(For what it's worth, it's not an OpenSolaris-specific issue.  These
sorts of unexpected RSTs plague users and are hard to diagnose properly,
even on a good day.)

Sorry. I didn't want to set the doubt on Opensolaris.

Furthermore, the receiving end doesn't even get FIN packets from the
sending end.

That's correct behavior when RST is seen.  RST is an abortive disconnect
-- quit right now; don't try to finish.  FIN is an orderly disconnect --
finish sending what you have then close normally.

RST doesn't (and shouldn't) cause FIN.

Again, if I remember well, one should send FIN packets and wait for
FIN/ACK (or timeout I guess) before closing the connection.

Only for a normal connection termination.  If the connection is
_broken_, you should see RST instead.

Got it. I never though I could have forgotten that.

Thanks again for your reply : other point of views often bring up otherideas for the possible cause of the problem.


Have a nice evening,
Arnaud
_______________________________________________
networking-discuss mailing list
[email protected]

Re: [networking-discuss] Help needed on big transfers failure with e1000g

Reply via email to