Hi, I was using linux 2.1.79 on my cluster. The machines are Dell Poweredges 6100, Quad Ppro, Gigabyte memory, Onboard Adaptec 2940, intel etherexpress pro, communicating over a Intel 510T switch. I am running Message Passing (MPI) programs on these machines. Mostly in jobs with two processes on one node and one on the other, on terminating the job prematurely by a kill or ctrl-C, sockets get left in the FIN_WAIT1. Sometimes the corresponding socket is in LAST_ACK, indicating that one end went into LAST_ACK without doing the needful to get FIN_WAIT1 to FIN_WAIT2 or something like that. The FIN_WAIT1's never go away:(from netstat) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 23025 eip11.cluster01.en:4242 eip11.cluster01.en:4244 FIN_WAIT1 tcp 0 1 eip11.cluster01.en:4234 eip11.cluster01.en:4232 FIN_WAIT1 This behaviour goes away when I use 2.1.79 uniprocessor. There, the two processes on the same machine are on the same processor. Upgrading to 2.1.124 took away the FIN_WAIT1 problem, but the performance is very much slower. For jobs with 40,000 broadcasts and reduces, which take 8 seconds on 2.1.79, take 80 seconds on 2.1.124. I benchmarked the network performance for TCP and MPI, and the TCP performance of 2.1.124 is almost 105 lower. The results, using the Netpipe benchmark are at http://reno.cis.upenn.edu/~rahul/perf/, with graphs of throughput against block transfer time, and blocksize in both postcript and gif format. Is there some way to fix the FIN_WAIT1 problem? Or, why is the throughput less on 2.1.124? Is there some way to fix that instead? Does IO-APIC have anything to do with it(2.1.79 dosent seem to have it) Thanks a lot, Rahul
