We have a few dual PII 400 machines (P6DBE boards, eepro100 NICs) that we'd like to use in a Beowulf-like cluster. However, all recent kernels have exhibited the same problem, NPB2.3 MPI benchmarks keep getting stuck waiting for incoming data. This happens only when there are 2 MPI processes running on the same machine, there are no problems with one process per machine. This is the output of netstat -a -t for a job consisting of 8 MPI processes running on star1, star2, star3, and star4 nodes. star3> netstat -a -t Active Internet connections (including servers) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 *:sunrpc *:* LISTEN tcp 0 0 *:ftp *:* LISTEN tcp 0 0 *:telnet *:* LISTEN tcp 0 0 *:gopher *:* LISTEN tcp 0 0 *:shell *:* LISTEN tcp 0 0 *:login *:* LISTEN tcp 0 0 *:pop-2 *:* LISTEN tcp 0 0 *:pop *:* LISTEN tcp 0 0 *:imap *:* LISTEN tcp 0 0 *:finger *:* LISTEN tcp 0 0 *:time *:* LISTEN tcp 0 0 *:auth *:* LISTEN tcp 0 0 *:857 *:* LISTEN tcp 0 0 *:smtp *:* LISTEN tcp 0 0 *:10025 *:* LISTEN tcp 0 3 star3.messier:login starzero.messier:1013 ESTABLISHED tcp 0 0 star3.messier:shell star1.messier:1019 ESTABLISHED tcp 0 0 star3.messier:1023 star1.messier:1018 ESTABLISHED tcp 0 0 star3.messier:1147 star1.messier:1110 ESTABLISHED tcp 0 0 *:1148 *:* LISTEN tcp 0 0 star3.messier:shell star1.messier:1008 ESTABLISHED tcp 0 0 star3.messier:1022 star1.messier:1005 ESTABLISHED tcp 0 0 star3.messier:1149 star1.messier:1110 ESTABLISHED tcp 0 0 *:1150 *:* LISTEN tcp 0 0 star3.messier:1152 star2.messier:1130 ESTABLISHED tcp 0 0 star3.messier:1155 star3.messier:1154 ESTABLISHED tcp 0 0 star3.messier:1154 star3.messier:1155 ESTABLISHED tcp 0 0 star3.messier:1150 star1.messier:1121 TIME_WAIT tcp 0 0 star3.messier:1156 star1.messier:1122 ESTABLISHED tcp 0 0 star3.messier:1148 star4.messier:1134 TIME_WAIT tcp 0 0 star3.messier:1150 star4.messier:1136 TIME_WAIT tcp 0 0 star3.messier:1159 star4.messier:1138 ESTABLISHED tcp 0 7796 star3.messier:1160 star4.messier:1139 ESTABLISHED tcp 0 0 star3.messier:1148 star1.messier:1125 TIME_WAIT tcp 0 0 star3.messier:1162 star1.messier:1127 ESTABLISHED Recv-Q on star4 for "star3.messier:1160 star4.messier:1139" is empty. It stays this way until the connection times out. This is the tail of the tcpdump log for this connection. All traffic stops after several spurious duplicated acks. ... 13:49:31.298117 star4.messier.1139 > star3.messier.1160: . 1848225:1849673(1448) ack 1851121 win 7240 <nop,nop,timestamp 56303 57114> (DF) [tos 0x18] (ttl 64, id 16102) 13:49:31.298173 star3.messier.1160 > star4.messier.1139: . ack 1849673 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50303) 13:49:31.298125 star4.messier.1139 > star3.messier.1160: . ack 1854017 win 5792 <nop,nop,timestamp 56303 57114> (DF) [tos 0x18] (ttl 64, id 16103) 13:49:31.298367 star4.messier.1139 > star3.messier.1160: . 1849673:1851121(1448) ack 1855465 win 4344 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16107) 13:49:31.299299 star4.messier.1139 > star3.messier.1160: . 1851121:1852569(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16112) 13:49:31.299516 star4.messier.1139 > star3.messier.1160: . 1852569:1854017(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16113) 13:49:31.299573 star3.messier.1160 > star4.messier.1139: . ack 1854017 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50314) 13:49:31.299784 star4.messier.1139 > star3.messier.1160: . 1854017:1855465(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16114) 13:49:31.300049 star3.messier.1160 > star4.messier.1139: . ack 1856913 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50316) 13:49:31.300276 star4.messier.1139 > star3.messier.1160: . 1856913:1858361(1448) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16117) 13:49:31.300588 star3.messier.1160 > star4.messier.1139: . ack 1859809 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50318) 13:49:31.301098 star3.messier.1160 > star4.messier.1139: . ack 1862705 win 14480 <nop,nop,timestamp 57115 56303> (DF) [tos 0x18] (ttl 64, id 50320) 13:49:31.301089 star4.messier.1139 > star3.messier.1160: P 1862705:1863193(488) ack 1855465 win 15928 <nop,nop,timestamp 56303 57115> (DF) [tos 0x18] (ttl 64, id 16124) 13:49:31.301839 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16130) 13:49:31.301949 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16131) 13:49:31.302227 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16133) 13:49:31.302517 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16134) 13:49:31.302522 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56303 57115,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16136) 13:49:31.309051 star3.messier.1160 > star4.messier.1139: P 1863193:1863225(32) ack 1863193 win 15928 <nop,nop,timestamp 57116 56303> (DF) [tos 0x18] (ttl 64, id 50382) 13:49:31.309072 star3.messier.1160 > star4.messier.1139: P 1863225:1863261(36) ack 1863193 win 15928 <nop,nop,timestamp 57116 56303> (DF) [tos 0x18] (ttl 64, id 50383) 13:49:31.309248 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56304 57116,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16149) 13:49:31.309306 star4.messier.1139 > star3.messier.1160: . ack 1855465 win 15928 <nop,nop,timestamp 56304 57116,nop,nop,[|tcp]> (DF) [tos 0x18] (ttl 64, id 16150) 13:49:31.493869 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57135 56304> (DF) [tos 0x18] (ttl 64, id 50388) 13:49:31.893864 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57175 56304> (DF) [tos 0x18] (ttl 64, id 50389) 13:49:32.693868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57255 56304> (DF) [tos 0x18] (ttl 64, id 50410) 13:49:34.293868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57415 56304> (DF) [tos 0x18] (ttl 64, id 50433) 13:49:37.493868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 57735 56304> (DF) [tos 0x18] (ttl 64, id 50489) 13:49:43.893869 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 58375 56304> (DF) [tos 0x18] (ttl 64, id 50610) 13:49:56.693871 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 59655 56304> (DF) [tos 0x18] (ttl 64, id 50689) 13:50:22.293868 star3.messier.1160 > star4.messier.1139: . 1855465:1856913(1448) ack 1863193 win 15928 <nop,nop,timestamp 62215 56304> (DF) [tos 0x18] (ttl 64, id 50878) ...etc, until timeout Corresponding /proc/net/tcp record look like this: sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 0: 0501A8C0:048A 0301A8C0:0467 01 00000000:00000000 00:00000000 00000000 520 0 3122 1: 0501A8C0:0488 0601A8C0:0473 01 00001E74:00000000 01:00000CD0 00000008 520 0 3115 2: 0501A8C0:0487 0601A8C0:0472 01 00000000:00000000 00:00000000 00000000 520 0 3114 ... Also, I have one more report about the same problem with MPI on dual PII 400 systems. The hardware is slightly different (Gigabyte Ga-6BXDS boards, Tulip 21140 NICs) but the symptoms are the same. Any suggestions? I can offer remote access to the cluster if someone familiar with the networking code wants to take a closer look at this. Alex Korobka - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
