On Thu, Jul 05, 2007 at 01:26:44PM +0100, Adrian wrote:
> Just to keep you up to date with goings on.
> 
> Over the past few days I have been running tcpdump / wireshark on the server 
> and client machines, going over the pcap files and getting my users to note 
> down when they have problems.
> 
> There seems to be a consistant pattern. Below is a sample of the dump files 
> around the time of connection problems:
> 
> Client Machine:
>  185156 2007-07-04 16:32:27.708162 192.168.10.6          192.168.10.93        
>  
> IMAP     Response: 58 OK IDLE completed
>  185157 2007-07-04 16:32:27.737345 192.168.10.93         192.168.10.6         
>  
> IMAP     Request: 59 check
>  185158 2007-07-04 16:32:27.738016 192.168.10.6          192.168.10.93        
>  
> TCP      [TCP ACKed lost segment] imap2 > 1665 [RST] Seq=10795 Len=0

Hmm, what program is generating this output from the pcap files? They are a
bit difficult to interpret as you've shown here, because the source and
destination port numbers are not visible. That is, it's not clear if the RST
in line 3 is for the same IMAP session as the previous two lines. Each
separate TCP connection from the same client will have a unique source port
number. For really detailled output, try

   tcpdump -r foo.pcap -n -s0 -vX

Anyway, in line 3 the server (imap2) has sent a RST to the client. If this
is for the same TCP session as the previous two lines then this is very
strange. Even if the server process were to crash and die (even with kill
-9), I'd expect the kernel to close the socket properly with a FIN exchange.

Possibly it indicates a protocol violation - e.g. an ACK for a packet you've
not even received yet. This could indicate a bug in the TCP stack.

>  185159 2007-07-04 16:32:27.785269 192.168.10.93         192.168.10.6         
>  
> TCP      1879 > imap2 [SYN] Seq=0 Len=0 MSS=1460
>  185160 2007-07-04 16:32:27.785929 192.168.10.6          192.168.10.93        
>  
> TCP      imap2 > 1879 [RST, ACK] Seq=0 Ack=1 Win=0 Len=0

Looks like the client has tried to connect, and the server has rejected the
connection immmediately with RST. In this case I *can* see the port numbers,
so I can see these packets belong to the same session.

If it's not to do with the intervening network, it could indicate a resource
limit reached in the server. But see below.

> Server:
...
>  129624 2007-07-04 16:32:28.034988 192.168.10.6          192.168.10.93        
>  
> IMAP     Response: 58 OK IDLE completed
>  129625 2007-07-04 16:32:28.396856 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed
>  129626 2007-07-04 16:32:29.116820 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed
>  129627 2007-07-04 16:32:30.556778 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed
>  129630 2007-07-04 16:32:33.436668 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed
>  129633 2007-07-04 16:32:39.196422 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed
>  129722 2007-07-04 16:32:50.716508 192.168.10.6          192.168.10.93        
>  
> IMAP     [TCP Retransmission] Response: 58 OK IDLE completed

Aha. Notice here that what the client sees is not the same as what the
server sees. The client says it received a RST frame from the server. The
server claims never to have sent it! It sent a TCP frame, and got no
response from the client. So the kernel keeps resending it, until it
eventually gives up.

If the connection really were closed, each of these subsequent TCP frames
should also solicit a RST from the other side.

Furthermore, the client then sends SYNs to the server which are rejected
with RST, but these SYNs never arrive at the server either, according to
your tcpdump.

> It looks to me like there may be some network timing issues which are causing 
> the client machine to reset the connection.

It's definitely not network timing. TCP is robust against loss and
re-ordering of packets. As you can see, the sending party retransmits
repeatedly when the other side doesn't acknowledge.

However, the fact that the server doesn't see the RST suggests either that
something stateful is sitting between the server and the client, or else
that this traffic is ending up somewhere else.

Now, this loss of traffic could be to do with something complex modifying
and intercepting traffic (such as a firewall). But there's another, much
simpler explanation: someone has plugged a second device onto the network
with the same IP address as the server, i.e. 192.168.10.6.

That's a simple explanation which I think would give exactly the symptoms
you describe. You can prove this easily enough from your existing packet
capture. Use tcpdump to read the client's pcap file, and add the '-e' flag.
This will show you the source and destination MAC address (ethernet card
address) of each frame. Or if you're using wireshark, just examine the
ethernet headers in the GUI.

If you see frames coming from 192.168.10.6 with MAC address
xx:xx:xx:xx:xx:xx before the problem occurs, and then at problem time you
see frames from 192.168.10.6 with MAC address yy:yy:yy:yy:yy:yy, then you
have it.

So that's my theory of the day. If I'm right, then you then just need to
locate the imposter's MAC address. Google for "mac address finder" to get a
tool which will tell you the manufacturer. Or if you have managed switches,
telnet into each one and look at the forwarding tables to locate which
physical port it's connected to.

> This might explain another reason 
> why I am not seeing a problem as most of the client machines have 100m/b 
> cards but my predecessor built the IT managers workstation and of course put 
> all the really good kit in it which means I have a Gigabit network card (or 
> so I discovered when I plugged the cable into the nice new gigabit network 
> switch ). I had assumed that the problem was related to the OS not the 
> hardware

The OS could be another candidate, because that's where the TCP stack lies,
and a bad TCP implementation could end up getting its knickers in a twist.
OTOH, if one side sends a RST then the other side should still receive it.

A related candidate is the network card itself, if it performs TCP checksum
offloading. From this point of view, a cheap but reliable 100Mbps card is
*better* than a fancy gigabit card with a buggy checksum offloading
implementation.

However this doesn't necessarily explain the problems seen above, unless the
card were faking a TCP RST and not sending it to the other side.

Another possibility would be any kind of redundancy mechanism you have in
place - e.g. two IMAP servers running concurrently on two different PCs,
with some mechanism for moving the IP address from one to the other. As soon
as the other machine takes over the first machine's IP address, you'll get
RSTs for existing sessions.

But that's what lead me to the realisation that you could just have a second
machine plugged in with the same IP address.

> To test the timing theory I have put a 100m/b card into the server along with 
> the gigabit card, half the client machines are connecting through the 100m/b 
> and the other half through the gigabit. Once again watch this space.

I don't think there's any chance that it's to do with timing, trust me. But
this test would be a good way of trying out the TCP checksum offloading
issue, or if there's an imposter, of locating which half of the network it's
on.

You still want to make sure that there's nothing in between the client and
the server which could possibly be acting as a firewall, or otherwise
affects traffic at layer 4 or above (e.g. network load-balancer).

> Thank you for all your help and suggestions thus far, hopefully I'm getting 
> close to nailing this once and for all.

It's definitely an unusual problem!

Regards,

Brian.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Courier-imap mailing list
[email protected]
Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-imap

Reply via email to