Roland, Thank you for your response. That fixed my initial buffer allocation failure. After we tuned the Lustre and reran same IOZONE tests again, we got the following problem. Was there an actual network interrupt? If so, the problem is not obvious now; the two nodes are pinging over IPoIB. Please advice.
Thanks, Helen ---- Dmesg Report from Lustre server ----- NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2846 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3846 Lustre: A connection with 192.168.2.79 timed out; the network or that node may be down. LustreError: 10501:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024f ip 192.168.2.79:1021 LustreError: 10793:0:(ldlm_lib.c:506:target_handle_reconnect()) 460e5_lov2_7d3910bb5c reconnecting ----- Dmesg from Lustre client (192.168.2.79) ------ NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 1965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 2965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 3965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 4965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 5965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 6965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 7965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 8965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 9965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 10965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 11965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 12965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 13965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 14965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 15965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 16965 NETDEV WATCHDOG: ib0: transmit timed out ib0: transmit timeout: latency 17965 Lustre: 10035:0:(socknal_cb.c:1326:ksocknal_process_receive()) [f6256000] EOF from 0xc0a80253 ip 192.168.2.83:988 LustreError: 10169:0:(client.c:568:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 [EMAIL PROTECTED] x13853/t0 o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107 LustreError: Connection to service on5-ost2 via nid 192.168.2.76 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED] LustreError: This client was evicted by on5-ost2; in progress operations using this service will fail. LustreError: 10413:0:(rw.c:1253:ll_readpage()) page c1538cc0 map f6193328 index 825344 flags 20001023 count 3 priv e91da940: lock match failed: rc -5 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13862/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13868/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 4 similar messages LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x13880/t0 o3->[EMAIL PROTECTED]:6 lens 328/280 ref 2 fl Rpc:/0/0 rc 0/0 LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously skipped 11 similar messages Lustre: A connection with 192.168.2.75 timed out; the network or that node may be down. LustreError: 10041:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a8024b ip 192.168.2.75:988 Lustre: Connection restored to service on5-ost2 using nid 192.168.2.76. Lustre: 10496:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on5-ost2_MNT_on8-ib_2: connection restored to [EMAIL PROTECTED] LustreError: 10169:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout (sent at 1129234515, 101s ago) [EMAIL PROTECTED] x13850/t0 o400->[EMAIL PROTECTED]:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: Connection to service on12-mds2 via nid 192.168.2.83 was lost; in progress operations using this service will wait for recovery to complete. Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) MDC_on8_on12-mds2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED] Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74. Lustre: 10170:0:(import.c:714:ptlrpc_import_recovery_state_machine()) OSC_on8_on3-ost2_MNT_on8-ib_2: connection restored to [EMAIL PROTECTED] _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
