Brock Palen wrote: > I am having servers LBUG on a regular basis, Clients are running > 1.6.6 patchless on RHEL4, servers are running RHEL4 with 1.6.5.1 > RPM's from the download page. All connection is over Ethernet, > Servers are x4600's.
This looks like bug 16496, which is fixed in 1.6.6. You should upgrade your servers to 1.6.6 cliffw > > The OSS that BUG'd has in its log: > > Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(ldlm_lock.c: > 430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed > Jan 13 16:35:39 oss2 kernel: LustreError: 10243:0:(tracefile.c: > 432:libcfs_assertion_failed()) LBUG > Jan 13 16:35:39 oss2 kernel: Lustre: 10243:0:(linux-debug.c: > 167:libcfs_debug_dumpstack()) showing stack for process 10243 > Jan 13 16:35:39 oss2 kernel: ldlm_cn_08 R running task 0 > 10243 1 10244 7776 (L-TLB) > Jan 13 16:35:39 oss2 kernel: 0000000000000000 ffffffffa0414629 > 00000103d83c7e00 0000000000000000 > Jan 13 16:35:39 oss2 kernel: 00000101f8c88d40 ffffffffa021445e > 00000103e315dd98 0000000000000001 > Jan 13 16:35:39 oss2 kernel: 00000101f3993ea0 0000000000000000 > Jan 13 16:35:39 oss2 kernel: Call Trace:<ffffffffa0414629> > {:ptlrpc:ptlrpc_server_handle_request+2457} > Jan 13 16:35:39 oss2 kernel: <ffffffffa021445e> > {:libcfs:lcw_update_time+30} <ffffffff80133855>{__wake_up_common+67} > Jan 13 16:35:39 oss2 kernel: <ffffffffa0416d05> > {:ptlrpc:ptlrpc_main+3989} <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} > Jan 13 16:35:39 oss2 kernel: <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0415270> > {:ptlrpc:ptlrpc_retry_rqbds+0} > Jan 13 16:35:39 oss2 kernel: <ffffffff80110de3>{child_rip+8} > <ffffffffa0415d70>{:ptlrpc:ptlrpc_main+0} > Jan 13 16:35:39 oss2 kernel: <ffffffff80110ddb>{child_rip+0} > Jan 13 16:35:40 oss2 kernel: LustreError: dumping log to /tmp/lustre- > log.1231882539.10243 > > > At the same time a client (nyx346) lost contact with that oss, and is > never allowed to reconnect. > Client /var/log/message: > > Jan 13 16:37:20 nyx346 kernel: Lustre: nobackup-OST000d- > osc-000001022c2a7800: Connection to service nobackup-OST000d via nid > 10.164.3....@tcp was lost; in progress operations using this service > will wait for recovery to complete.Jan 13 16:37:20 nyx346 kernel: > Lustre: Skipped 6 previous similar messagesJan 13 16:37:20 nyx346 > kernel: LustreError: 3889:0:(ldlm_request.c:996:ldlm_cli_cancel_req > ()) Got rc -11 from cancel RPC: canceling anywayJan 13 16:37:20 > nyx346 kernel: LustreError: 3889:0:(ldlm_request.c: > 1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11Jan 13 16:37:20 > nyx346 kernel: LustreError: 11-0: an error occurred while > communicating with 10.164.3....@tcp. The ost_connect operation failed > with -16Jan 13 16:37:20 nyx346 kernel: LustreError: Skipped 10 > previous similar messages > Jan 13 16:37:45 nyx346 kernel: Lustre: 3849:0:(import.c: > 410:import_select_connection()) nobackup-OST000d- > osc-000001022c2a7800: tried all connections, increasing latency to 7s > > Even now the server(OSS) is refusing connection to OST00d, with the > message: > > Lustre: 9631:0:(ldlm_lib.c:760:target_handle_connect()) nobackup- > OST000d: refuse reconnection from 145a1ec5-07ef- > f7eb-0ca9-2a2b6503e...@10.164.1.90@tcp to 0x00000103d5ce7000; still > busy with 2 active RPCs > > > If I reboot the OSS, the OST's on it go though recovery like normal, > and then the client is fine. > > Network looks clean, found one machine with lots of dropped packets > between the servers, but that is not the client in question. > > Thank you! If it happens again, and I find any other data I will let > you know. > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss