Roland,

It doesn't seem like shrinking the TCP window had helped.  I captured the
Dmesg log from Lustre server and associated client reporting IOZONE error.
BTW, this problem is a moving target so it is hard to believe that it
is hardware related(?)  BTW, I am using the mellanox DDR switch and HCA.

Thanks,
Helen

------- Dmesg from Lustre server ------
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 1638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 2638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 3638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 4638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 5638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 6638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 7638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 8638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 9638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 10638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 11638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 12638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 13638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 14638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 15638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 16638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 17638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 18638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 19638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 20638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 21638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 22638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 23638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 24638
LustreError: 12471:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk 
GET [EMAIL PROTECTED] x20249/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 12485:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm 
error evicting [EMAIL PROTECTED] id
192.168.2.73-12345
LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk 
GET [EMAIL PROTECTED] x20359/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 12468:0:(ost_handler.c:735:ost_brw_write()) previously skipped 1 
similar messages
LustreError: 12477:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm 
error evicting [EMAIL PROTECTED] id
192.168.2.78-12345
LustreError: 12477:0:(filter.c:1728:filter_grant_sanity_check()) 
filter_disconnect: tot_granted 48570368 != fo_tot_granted 49618944
LustreError: 12477:0:(filter.c:1731:filter_grant_sanity_check()) 
filter_disconnect: tot_pending 7340032 != fo_tot_pending 8388608
Lustre: A connection with 192.168.2.80 timed out; the network or that node may 
be down.
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a80250 ip 192.168.2.80:1022
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 25638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 26638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 27638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 28638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 29638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 30638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 31638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 32638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 33638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 34638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 35638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 36638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 37638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 38638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 39638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 40638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 41638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 42638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 43638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 44638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 45638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 46638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 47638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 48638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 49638
LustreError: A timeout occurred receiving data from 192.168.2.73; the network 
or that node may be down.
LustreError: 12189:0:(socknal_cb.c:2214:ksocknal_find_timed_out_conn()) Timed 
out RX from 0xc0a80249 f2630000 192.168.2.73
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a80249 ip 192.168.2.73:1021
LustreError: 12189:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing 
partial receive from 0xc0a8024e, ip 192.168.2.78:1021, with error
LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 
19, desc eb0c8000
LustreError: 12189:0:(events.c:320:server_bulk_callback()) event type 5, status 
19, desc f2603000
LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) on3-ost2: bulk IO comm 
error evicting [EMAIL PROTECTED] id
192.168.2.78-12345
LustreError: 12468:0:(ost_handler.c:822:ost_brw_write()) previously skipped 6 
similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 50638
Lustre: A connection with 192.168.2.79 timed out; the network or that node may 
be down.
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a8024f ip 192.168.2.79:1021
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) 
previously skipped 1 similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 51638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 52638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 53638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 54638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 55638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 56638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 57638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 58638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 59638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 60638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 61638
Lustre: A connection with 192.168.2.72 timed out; the network or that node may 
be down.
Lustre: previously skipped 3 similar messages
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a80248 ip 192.168.2.72:1021
LustreError: 12189:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) 
previously skipped 3 similar messages
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 62638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 63638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 64638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 65638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 66638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 67638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 68638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 69638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 70638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 71638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 72638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 73638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 74638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 75638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 76638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 77638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 78638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 79638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 80638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 81638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 82638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 83638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 84638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 85638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 86638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 87638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 88638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 89638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 90638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 91638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 92638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 93638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 94638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 95638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 96638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 97638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 98638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 99638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 100638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 101638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 102638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 103638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 104638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 105638
NETDEV WATCHDOG: ib0: transmit timed out
ib0: transmit timeout: latency 106638
LustreError: 12458:0:(ldlm_lib.c:506:target_handle_reconnect()) 
709aa0a3-a6a1-4134-b2b4-805212eb9430 reconnecting
Lustre: 12470:0:(filter.c:2645:filter_set_info()) on3-ost1: received MDS 
connection (0xbc2765ac563141df)
Lustre: 12486:0:(filter.c:2082:filter_destroy_precreated()) on3-ost2: deleting 
orphan objects from 6 to 67
Lustre: 12583:0:(llog_cat.c:352:llog_cat_process_cb()) processing log 
0x149423e:3575f5db at index 2 of catalog 0x149423a
Lustre: 12583:0:(filter_log.c:235:filter_recov_log_mds_ost_cb()) fetch 
generation log, send cookie
Lustre: 12583:0:(llog.c:287:llog_process()) recovery from log: 
0x149423e:3575f5db stopped
LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) 
8ebea_lov2_7a4510c13a reconnecting
LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) 
e24e8_lov1_13fb4ed690 reconnecting
LustreError: 12488:0:(ldlm_lib.c:506:target_handle_reconnect()) previously 
skipped 1 similar messages
LustreError: 12456:0:(ldlm_lib.c:506:target_handle_reconnect()) previously 
skipped 1 similar messages
LustreError: 12461:0:(ldlm_lib.c:506:target_handle_reconnect()) 
97cda_lov2_81558eef0b reconnecting
LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) 
03c5b_lov2_084e2d0661 reconnecting
LustreError: 12462:0:(ldlm_lib.c:506:target_handle_reconnect()) previously 
skipped 1 similar messages
LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) 
8da95_lov1_79a1a2e0bd reconnecting
LustreError: 12467:0:(ldlm_lib.c:506:target_handle_reconnect()) previously 
skipped 4 similar messages
LustreError: 12454:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129239844, 100s ago) [EMAIL PROTECTED] x5/t0
o401->@NET_0xc0a80253_UUID:15 lens 104/64 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 12454:0:(recov_thread.c:410:log_commit_thread()) commit 
f538e000:f7679e80 drop 1 cookies: rc -110


--------- Dmesg from Lustre client -----------------------
Lustre: A connection with 192.168.2.74 timed out; the network or that node may 
be down.
LustreError: 11145:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts()) Timeout 
out conn->0xc0a8024a ip 192.168.2.74:988
LustreError: 11143:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) 
Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988
LustreError: Host 192.168.2.74 was unreachable; the network or that node may be 
down, or Lustre may be misconfigured.
LustreError: 11143:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet 
type 1 len 64 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73)
LustreError: 11143:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 
[EMAIL PROTECTED] x20271/t0 o400->[EMAIL PROTECTED]:6 lens
64/64 ref 2 fl Rpc:N/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129239884, 3s ago) [EMAIL PROTECTED] x20271/t0
o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: Connection to service on3-ost2 via nid 192.168.2.74 was lost; in 
progress operations using this service will wait for recovery to
complete.
Lustre: 11269:0:(import.c:142:ptlrpc_set_import_discon()) 
OSC_on2_on3-ost2_MNT_on2-ib_2: connection lost to [EMAIL PROTECTED]
LustreError: 11270:0:(lib-move.c:1510:lib_api_put()) Error sending PUT to 
0xc0a8024a: 19
LustreError: 11141:0:(socknal_lib-linux.c:813:ksocknal_lib_connect_sock()) 
Error -113 connecting 192.168.2.73/1022 -> 192.168.2.74/988
LustreError: Host 192.168.2.74 was unreachable; the network or that node may be 
down, or Lustre may be misconfigured.
LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) Deleting packet 
type 1 len 240 (0xc0a80249 192.168.2.73->0xc0a8024a 192.168.2.73)
LustreError: 11141:0:(socknal_cb.c:2103:ksocknal_autoconnect()) previously 
skipped 1 similar messages
LustreError: 11141:0:(events.c:61:request_out_callback()) @@@ type 8, status 19 
[EMAIL PROTECTED] x20283/t0 o8->[EMAIL PROTECTED]:6 lens
240/144 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11141:0:(events.c:61:request_out_callback()) previously skipped 3 
similar messages
LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129239912, 3s ago) [EMAIL PROTECTED] x20283/t0
o8->[EMAIL PROTECTED]:6 lens 240/144 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 11270:0:(client.c:945:ptlrpc_expire_one_request()) previously 
skipped 3 similar messages
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129239819, 100s ago) [EMAIL PROTECTED] x20242/t0
o4->[EMAIL PROTECTED]:6 lens 328/288 ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously 
skipped 1 similar messages
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1129239834, 100s ago) [EMAIL PROTECTED] x20256/t0
o400->[EMAIL PROTECTED]:6 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0
LustreError: 11269:0:(client.c:945:ptlrpc_expire_one_request()) previously 
skipped 8 similar messages
Lustre: Connection restored to service on3-ost1 using nid 192.168.2.74.
Lustre: 11270:0:(import.c:714:ptlrpc_import_recovery_state_machine()) 
OSC_on2_on3-ost1_MNT_on2-ib: connection restored to
[EMAIL PROTECTED]
LustreError: This client was evicted by on3-ost2; in progress operations using 
this service will fail.
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x20302/t0 o4->[EMAIL PROTECTED]:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x20303/t0 o4->[EMAIL PROTECTED]:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x20305/t0 o4->[EMAIL PROTECTED]:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x20306/t0 o4->[EMAIL PROTECTED]:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11269:0:(client.c:502:ptlrpc_import_delay_req()) @@@ IMP_INVALID 
[EMAIL PROTECTED] x20307/t0 o4->[EMAIL PROTECTED]:6 lens 328/288
ref 2 fl Rpc:/0/0 rc 0/0
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c1925dc0 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c1779840 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 
275 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c177d820 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 
485 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c1792560 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 
815 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c18dd440 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 
1399 similar messages
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page 
c18e3600 failed: -5
LustreError: 11530:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 
2637 similar messages
Lustre: Connection restored to service on3-ost2 using nid 192.168.2.74.
Lustre: 11530:0:(import.c:714:ptlrpc_import_recovery_state_machine()) 
OSC_on2_on3-ost2_MNT_on2-ib_2: connection restored to
[EMAIL PROTECTED]




>From hycsw Thu Oct 13 14:21:18 2005
A
From: hycsw (Helen Chen)
Message-Id: <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: [openib-general] Re: ib0: ipoib_ib_post_receive failed for buf 111 
ib0: failed to allocate receive buffer
Cc: [EMAIL PROTECTED], [email protected]
Status: R

Roland,

>From [EMAIL PROTECTED] Thu Oct 13 13:53:05 2005
>
>    Helen> Roland, Thank you for your response.  That fixed my initial
>    Helen> buffer allocation failure.  After we tuned the Lustre and
>    Helen> reran same IOZONE tests again, we got the following
>    Helen> problem.  Was there an actual network interrupt? If so, the
>    Helen> problem is not obvious now; the two nodes are pinging over
>    Helen> IPoIB.  Please advice.
>
>That's very odd.  This message:
>
>    Helen> NETDEV WATCHDOG: ib0: transmit timed out
>    Helen> ib0: transmit timeout: latency 1846
>
>says that we are not seeing send completions from the HCA.  However,
>are you saying that even when you are seeing this message, ping over
>IPoIB is working?
>

No, I didn't know there were any problem until IOZONE reported read 
error from the Lustre Client.  

BTW, the backend storage is iSCSI over 10 GbE using jumbo frame.  This
pl\roblem only appeared after our tuning errfor: we increased the iSCSI
payload to 1 MB, and increased the TCP window to 512 KB from 256 KB. I
will shrink my TCP window and see if the problem goes away.

Thanks,
Helen

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to