Hi,
I am running stock IB stack distributed with 2.6.12-5 kernel from gen2.
We installed 1.4.4 Lustre to run on top of IPoIB, When and ran
concurrent IOZONE sessions from 8 clients to 4 servers I got "ib0:
failed to allocate receive buffer" in demesg, and with corresponding
IOzone read/write errors. And if I don't terminate my IOZONE sessions,
the ib0 interface would shutdown eventually. Increasing
net.core.netdev_max_backlog to 3000 from 300 didn't solve the problem.
Is there another parameter to tweek?
BTW, I am attaching the entries from dmesg for your information.
Thanks, Helen
----- Demesg report -----
ib0: ipoib_ib_post_receive failed for buf 111
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 112
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 113
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 114
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 115
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 116
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 117
ib0: failed to allocate receive buffer
ib0: ipoib_ib_post_receive failed for buf 118
LustreError: 3838:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk
GET [EMAIL PROTECTED] x1529751/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 3853:0:(ost_handler.c:735:ost_brw_write()) @@@ timeout on bulk
GET [EMAIL PROTECTED] x1553961/t0 o4-><?>@<?>:-1 lens 328/288 ref 0 fl
Interpret:/0/0 rc 0/0
LustreError: 3865:0:(ost_handler.c:822:ost_brw_write()) on6-ost2: bulk IO
comm e rror evicting [EMAIL PROTECTED] id
192.168.2.78-12345Lus
tre: A connection with 192.168.2.72 timed out; the network or that node may
be d own.
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
Timeout o ut conn->0xc0a80248 ip 192.168.2.72:1021
Lustre: A connection with 192.168.2.79 timed out; the network or that node
may b e down.
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
Timeout o ut conn->0xc0a8024f ip 192.168.2.79:1021
Lustre: A connection with 192.168.2.73 timed out; the network or that node
may b e down.
Lustre: previously skipped 2 similar messages
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
Timeout o ut conn->0xc0a80249 ip 192.168.2.73:1022
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
previousl y skipped 2 similar messages
LustreError: 3541:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing
partial receive from 0xc0a80249, ip 192.168.2.73:1021, with error
LustreError: 3541:0:(events.c:320:server_bulk_callback()) event type 5,
status 1 9, desc f37e1000
LustreError: 3838:0:(ost_handler.c:822:ost_brw_write()) on6-ost2: bulk IO
comm e rror evicting [EMAIL PROTECTED] id
192.168.2.73-12345Lus
treError: 3838:0:(ost_handler.c:822:ost_brw_write()) previously skipped 1
simila r messages
LustreError: 3838:0:(filter.c:1728:filter_grant_sanity_check())
filter_disconnec
t: tot_granted 58273792 != fo_tot_granted 59322368
LustreError: 3838:0:(filter.c:1731:filter_grant_sanity_check())
filter_disconnec
t: tot_pending 0 != fo_tot_pending 1048576
LustreError: 3838:0:(filter.c:1728:filter_grant_sanity_check())
filter_destroy_e
xport: tot_granted 38592512 != fo_tot_granted 39641088
LustreError: 3838:0:(filter.c:1731:filter_grant_sanity_check())
filter_destroy_e
xport: tot_pending 0 != fo_tot_pending 1048576
Lustre: A connection with 192.168.2.83 timed out; the network or that node
may b e down.
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
Timeout o ut conn->0xc0a80253 ip 192.168.2.83:1021
Lustre: A connection with 192.168.2.78 timed out; the network or that node
may b e down.
LustreError: 3541:0:(socknal_cb.c:2264:ksocknal_check_peer_timeouts())
Timeout o ut conn->0xc0a8024e ip 192.168.2.78:1021
LustreError: 3541:0:(socknal.c:1329:ksocknal_destroy_conn()) Completing
partial receive from 0xc0a8024e, ip 192.168.2.78:1021, with error
LustreError: 3541:0:(events.c:320:server_bulk_callback()) event type 5,
status 1 9, desc d4e17000
LustreError: 3853:0:(ost_handler.c:822:ost_brw_write()) on6-ost2: bulk IO
comm e rror evicting [EMAIL PROTECTED] id
192.168.2.78-12345Lus
treError: 3834:0:(client.c:945:ptlrpc_expire_one_request()) @@@ timeout
(sent at 1128772410, 100s ago) [EMAIL PROTECTED] x195/t0
o401->@NET_0xc0a80253_UUID:15 lens 296/64 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 3834:0:(recov_thread.c:410:log_commit_thread()) commit
ebb94000:e26 d8380 drop 7 cookies: rc -110
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general