On Thu, Jul 05, 2007 at 01:26:44PM +0100, Adrian wrote: > Just to keep you up to date with goings on. > > Over the past few days I have been running tcpdump / wireshark on the server > and client machines, going over the pcap files and getting my users to note > down when they have problems. > > There seems to be a consistant pattern. Below is a sample of the dump files > around the time of connection problems: > > Client Machine: > 185156 2007-07-04 16:32:27.708162 192.168.10.6 192.168.10.93 > > IMAP Response: 58 OK IDLE completed > 185157 2007-07-04 16:32:27.737345 192.168.10.93 192.168.10.6 > > IMAP Request: 59 check > 185158 2007-07-04 16:32:27.738016 192.168.10.6 192.168.10.93 > > TCP [TCP ACKed lost segment] imap2 > 1665 [RST] Seq=10795 Len=0
Hmm, what program is generating this output from the pcap files? They are a bit difficult to interpret as you've shown here, because the source and destination port numbers are not visible. That is, it's not clear if the RST in line 3 is for the same IMAP session as the previous two lines. Each separate TCP connection from the same client will have a unique source port number. For really detailled output, try tcpdump -r foo.pcap -n -s0 -vX Anyway, in line 3 the server (imap2) has sent a RST to the client. If this is for the same TCP session as the previous two lines then this is very strange. Even if the server process were to crash and die (even with kill -9), I'd expect the kernel to close the socket properly with a FIN exchange. Possibly it indicates a protocol violation - e.g. an ACK for a packet you've not even received yet. This could indicate a bug in the TCP stack. > 185159 2007-07-04 16:32:27.785269 192.168.10.93 192.168.10.6 > > TCP 1879 > imap2 [SYN] Seq=0 Len=0 MSS=1460 > 185160 2007-07-04 16:32:27.785929 192.168.10.6 192.168.10.93 > > TCP imap2 > 1879 [RST, ACK] Seq=0 Ack=1 Win=0 Len=0 Looks like the client has tried to connect, and the server has rejected the connection immmediately with RST. In this case I *can* see the port numbers, so I can see these packets belong to the same session. If it's not to do with the intervening network, it could indicate a resource limit reached in the server. But see below. > Server: ... > 129624 2007-07-04 16:32:28.034988 192.168.10.6 192.168.10.93 > > IMAP Response: 58 OK IDLE completed > 129625 2007-07-04 16:32:28.396856 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed > 129626 2007-07-04 16:32:29.116820 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed > 129627 2007-07-04 16:32:30.556778 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed > 129630 2007-07-04 16:32:33.436668 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed > 129633 2007-07-04 16:32:39.196422 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed > 129722 2007-07-04 16:32:50.716508 192.168.10.6 192.168.10.93 > > IMAP [TCP Retransmission] Response: 58 OK IDLE completed Aha. Notice here that what the client sees is not the same as what the server sees. The client says it received a RST frame from the server. The server claims never to have sent it! It sent a TCP frame, and got no response from the client. So the kernel keeps resending it, until it eventually gives up. If the connection really were closed, each of these subsequent TCP frames should also solicit a RST from the other side. Furthermore, the client then sends SYNs to the server which are rejected with RST, but these SYNs never arrive at the server either, according to your tcpdump. > It looks to me like there may be some network timing issues which are causing > the client machine to reset the connection. It's definitely not network timing. TCP is robust against loss and re-ordering of packets. As you can see, the sending party retransmits repeatedly when the other side doesn't acknowledge. However, the fact that the server doesn't see the RST suggests either that something stateful is sitting between the server and the client, or else that this traffic is ending up somewhere else. Now, this loss of traffic could be to do with something complex modifying and intercepting traffic (such as a firewall). But there's another, much simpler explanation: someone has plugged a second device onto the network with the same IP address as the server, i.e. 192.168.10.6. That's a simple explanation which I think would give exactly the symptoms you describe. You can prove this easily enough from your existing packet capture. Use tcpdump to read the client's pcap file, and add the '-e' flag. This will show you the source and destination MAC address (ethernet card address) of each frame. Or if you're using wireshark, just examine the ethernet headers in the GUI. If you see frames coming from 192.168.10.6 with MAC address xx:xx:xx:xx:xx:xx before the problem occurs, and then at problem time you see frames from 192.168.10.6 with MAC address yy:yy:yy:yy:yy:yy, then you have it. So that's my theory of the day. If I'm right, then you then just need to locate the imposter's MAC address. Google for "mac address finder" to get a tool which will tell you the manufacturer. Or if you have managed switches, telnet into each one and look at the forwarding tables to locate which physical port it's connected to. > This might explain another reason > why I am not seeing a problem as most of the client machines have 100m/b > cards but my predecessor built the IT managers workstation and of course put > all the really good kit in it which means I have a Gigabit network card (or > so I discovered when I plugged the cable into the nice new gigabit network > switch ). I had assumed that the problem was related to the OS not the > hardware The OS could be another candidate, because that's where the TCP stack lies, and a bad TCP implementation could end up getting its knickers in a twist. OTOH, if one side sends a RST then the other side should still receive it. A related candidate is the network card itself, if it performs TCP checksum offloading. From this point of view, a cheap but reliable 100Mbps card is *better* than a fancy gigabit card with a buggy checksum offloading implementation. However this doesn't necessarily explain the problems seen above, unless the card were faking a TCP RST and not sending it to the other side. Another possibility would be any kind of redundancy mechanism you have in place - e.g. two IMAP servers running concurrently on two different PCs, with some mechanism for moving the IP address from one to the other. As soon as the other machine takes over the first machine's IP address, you'll get RSTs for existing sessions. But that's what lead me to the realisation that you could just have a second machine plugged in with the same IP address. > To test the timing theory I have put a 100m/b card into the server along with > the gigabit card, half the client machines are connecting through the 100m/b > and the other half through the gigabit. Once again watch this space. I don't think there's any chance that it's to do with timing, trust me. But this test would be a good way of trying out the TCP checksum offloading issue, or if there's an imposter, of locating which half of the network it's on. You still want to make sure that there's nothing in between the client and the server which could possibly be acting as a firewall, or otherwise affects traffic at layer 4 or above (e.g. network load-balancer). > Thank you for all your help and suggestions thus far, hopefully I'm getting > close to nailing this once and for all. It's definitely an unusual problem! Regards, Brian. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Courier-imap mailing list [email protected] Unsubscribe: https://lists.sourceforge.net/lists/listinfo/courier-imap
