Roland,

Thank you for your response.  That fixed my initial buffer
allocation failure.  After we tuned the Lustre and reran 
same IOZONE tests again, we got the following problem.
Was there an actual network interrupt? If so, the problem
is not obvious now; the two nodes are pinging over IPoIB.
Please advice.

Thanks,
Helen

---- Dmesg Report from Lustre server -----
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1846
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2846
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3846
Lustre: A connection with 192.168.2.79 timed out; the network or that node may 
be down.
LustreError: 10501:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a8024f ip 192.168.2.79:1021
LustreError: 10793:0:(ldlm_lib.c:506:target_handle_reconnect()) 
460e5_lov2_7d3910bb5c reconnecting

----- Dmesg from Lustre client (192.168.2.79) ------
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 4965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 5965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 6965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 7965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 8965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 9965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 10965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 11965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 12965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 13965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 14965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 15965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 16965
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 17965
Lustre: 10035:0:(socknal_cb.c:1326:ksocknal_process_receive()) [f6256000] EOF 
from 0xc0a80253 ip 192.168.2.83:988
LustreError: 10169:0:(client.c:568:ptlrpc_check_status()) @@@ type == 
PTL_RPC_MSG_ERR, err == -107 [EMAIL PROTECTED] x13853/t0
o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107
LustreError: Connection to service on5-ost2 via nid 192.168.2.76 was lost; in 
progress operations using this service will wait for recovery to
complete.
Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) 
OSC_on8_on5-ost2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED]
LustreError: This client was evicted by on5-ost2; in progress operations using 
this service will fail.
LustreError: 10413:0:(rw.c:1253:ll_readpage()) page c1538cc0 map f6193328 index 
825344 flags 20001023 count 3 priv e91da940: lock match failed: rc -5
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x13862/t0 o3->[EMAIL PROTECTED]:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x13868/t0 o3->[EMAIL PROTECTED]:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously 
skipped 4 similar messages
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x13880/t0 o3->[EMAIL PROTECTED]:6 lens 328/280
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 10169:0:(client.c:502:ptlrpc_import_delay_req()) previously 
skipped 11 similar messages
Lustre: A connection with 192.168.2.75 timed out; the network or that node may 
be down.
LustreError: 10041:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a8024b ip 192.168.2.75:988
Lustre: Connection restored to service on5-ost2 using nid 192.168.2.76.
Lustre: 10496:0:(import.c:714:ptlrpc_import_recovery_state_machine()) 
OSC_on8_on5-ost2_MNT_on8-ib_2: connection restored to
[EMAIL PROTECTED]
LustreError: 10169:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129234515, 101s ago) [EMAIL PROTECTED] x13850/t0
o400->[EMAIL PROTECTED]:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: Connection to service on12-mds2 via nid 192.168.2.83 was lost; in 
progress operations using this service will wait for recovery to
complete.
Lustre: 10169:0:(import.c:142:ptlrpc_set_import_discon()) 
MDC_on8_on12-mds2_MNT_on8-ib_2: connection lost to [EMAIL PROTECTED]
Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74.
Lustre: 10170:0:(import.c:714:ptlrpc_import_recovery_state_machine()) 
OSC_on8_on3-ost2_MNT_on8-ib_2: connection restored to
[EMAIL PROTECTED]

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to