Roland, It doesn't seem like shrinking the TCP window had helped. I captured the Dmesg log from Lustre server and associated client reporting IOZONE error. BTW, this problem is a moving target so it is hard to believe that it is hardware related(?) BTW, I am using the mellanox DDR switch and HCA.
Thanks, Helen ------- Dmesg from Lustre server ------ NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 18638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 19638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 20638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 21638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 22638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 23638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 24638 LustreError: 12471:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET [EMAIL PROTECTED] x20249/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12485:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting [EMAIL PROTECTED] id 192.168.2.73-12345 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk GET [EMAIL PROTECTED] x20359/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) previously skipped 1 similar messages LustreError: 12477:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting [EMAIL PROTECTED] id 192.168.2.78-12345 LustreError: 12477:0:(filter.c:1728:filter_grant_sanity_check()) filter_disconnect: tot_granted 48570368 != fo_tot_granted 49618944 LustreError: 12477:0:(filter.c:1731:filter_grant_sanity_check()) filter_disconnect: tot_pending 7340032 != fo_tot_pending 8388608 Lustre: A connection with 192.168.2.80 timed out; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80250 ip 192.168.2.80:1022 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 25638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 26638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 27638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 28638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 29638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 30638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 31638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 32638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 33638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 34638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 35638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 36638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 37638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 38638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 39638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 40638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 41638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 42638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 43638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 44638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 45638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 46638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 47638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 48638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 49638 LustreError: A timeout occurred receiving data from 192.168.2.73; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2214:ksocknal_find_timed_out_conn()) Timed out RX from 0xc0a80249 f2630000 192.168.2.73 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80249 ip 192.168.2.73:1021 LustreError: 12189:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing partial receive from 0xc0a8024e, ip 192.168.2.78:1021, with error LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc eb0c8000 LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 19, desc f2603000 LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm error evicting [EMAIL PROTECTED] id 192.168.2.78-12345 LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) previously skipped 6 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 50638 Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down. LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 1 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 51638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 52638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 53638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 54638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 55638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 56638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 57638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 58638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 59638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 60638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 61638 Lustre: A connection with 192.168.2.72 timed out; the network or that node may be down. Lustre: previously skipped 3 similar messages LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80248 ip 192.168.2.72:1021 LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) previously skipped 3 similar messages NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 62638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 63638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 64638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 65638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 66638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 67638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 68638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 69638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 70638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 71638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 72638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 73638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 74638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 75638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 76638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 77638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 78638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 79638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 80638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 81638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 82638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 83638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 84638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 85638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 86638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 87638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 88638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 89638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 90638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 91638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 92638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 93638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 94638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 95638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 96638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 97638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 98638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 99638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 100638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 101638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 102638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 103638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 104638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 105638 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 106638 LustreError: 12458:0:(ldlm_lib.c:506:target_handle_reconnect()) 709aa0a3-a6a1-4134-b2b4-805212eb9430 reconnecting Lustre: 12470:0:(filter.c:2645:filter_set_info()) on3-ost1: received MDS connection (0xbc2765ac563141df) Lustre: 12486:0:(filter.c:2082:filter_destroy_precreated()) on3-ost2: deleting orphan objects from 6 to 67 Lustre: 12583:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 0x149423e:3575f5db at index 2 of catalog 0x149423a Lustre: 12583:0:(filter_log.c:235:filter_recov_log_mds_ost_cb()) fetch generation log, send cookie Lustre: 12583:0:(llog.c:287:llog_process()) recovery from log: 0x149423e:3575f5db stopped LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) 8ebea_lov2_7a4510c13a reconnecting LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) e24e8_lov1_13fb4ed690 reconnecting LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12461:0:(ldlm_lib.c:506:target_handle_reconnect()) 97cda_lov2_81558eef0b reconnecting LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) 03c5b_lov2_084e2d0661 reconnecting LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 1 similar messages LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) 8da95_lov1_79a1a2e0bd reconnecting LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) previously skipped 4 similar messages LustreError: 12454:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239844, 100s ago) [EMAIL PROTECTED] x5/t0 o401->@NET_0xc0a80253_UUID:15 lens 104/64 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 12454:0:(recov_thread.c:410:log_commit_thread()) commit f538e000:f7679e80 drop 1 cookies: rc -110 --------- Dmesg from Lustre client ----------------------- Lustre: A connection with 192.168.2.74 timed out; the network or that node may be down. LustreError: 11145:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024a ip 192.168.2.74:988 LustreError: 11143:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988 LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 11143:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 64 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73) LustreError: 11143:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 [EMAIL PROTECTED] x20271/t0 o400->[EMAIL PROTECTED]:6 lens 64/64 ref 2 fl Rpc:N/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239884, 3s ago) [EMAIL PROTECTED] x20271/t0 o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: Connection to service on3-ost2 via nid 192.168.2.74 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 11269:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection lost to [EMAIL PROTECTED] LustreError: 11270:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 0xc0a8024a: 19 LustreError: 11141:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988 LustreError: Host 192.168.2.74 was unreachable; the network or that node may be down, or Lustre may be misconfigured. LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet type 1 len 240 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73) LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously skipped 1 similar messages LustreError: 11141:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 [EMAIL PROTECTED] x20283/t0 o8->[EMAIL PROTECTED]:6 lens 240/144 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11141:0:(events.c:61:request_out_callback()) previously skipped 3 similar messages LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239912, 3s ago) [EMAIL PROTECTED] x20283/t0 o8->[EMAIL PROTECTED]:6 lens 240/144 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 3 similar messages LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239819, 100s ago) [EMAIL PROTECTED] x20242/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 1 similar messages LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129239834, 100s ago) [EMAIL PROTECTED] x20256/t0 o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously skipped 8 similar messages Lustre: Connection restored to service on3-ost1 using nid 192.168.2.74. Lustre: 11270:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost1_MNT_on2-ib: connection restored to [EMAIL PROTECTED] LustreError: This client was evicted by on3-ost2; in progress operations using this service will fail. LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x20302/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x20303/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x20305/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x20306/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x20307/t0 o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1925dc0 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1779840 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 275 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c177d820 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 485 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c1792560 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 815 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18dd440 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 1399 similar messages LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c18e3600 failed: -5 LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 2637 similar messages Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74. Lustre: 11530:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on2_on3-ost2_MNT_on2-ib_2: connection restored to [EMAIL PROTECTED] >From hycsw Thu Oct 13 14:21:18 2005 A From: hycsw (Helen Chen) Message-Id: <[EMAIL PROTECTED]> To: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 ib0: failed to allocate receive buffer Cc: [EMAIL PROTECTED], [email protected] Status: R Roland, >From [EMAIL PROTECTED] Thu Oct 13 13:53:05 2005 > > Helen> Roland, Thank you for your response. That fixed my initial > Helen> buffer allocation failure. After we tuned the Lustre and > Helen> reran same IOZONE tests again, we got the following > Helen> problem. Was there an actual network interrupt? If so, the > Helen> problem is not obvious now; the two nodes are pinging over > Helen> IPoIB. Please advice. > >That's very odd. This message: > > Helen> NETDEV WATCHDOG: ib0: transmit timed out > Helen> ib0: transmit timeout: latency 1846 > >says that we are not seeing send completions from the HCA. However, >are you saying that even when you are seeing this message, ping over >IPoIB is working? > No, I didn't know there were any problem until IOZONE reported read error from the Lustre Client. BTW, the backend storage is iSCSI over 10 GbE using jumbo frame. This pl\roblem only appeared after our tuning errfor: we increased the iSCSI payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I will shrink my TCP window and see if the problem goes away. Thanks, Helen _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
