Hi all, We are seeing this also, with clients and servers running 2.6.18-92.1.26.el5_lustre.1.6.7.2smp, tcp over gig-e only, after an upgrade from 1.6.5.1 over the weekend. (it appears that older client versions are working fine, but I've had a couple of the new ones without trouble too so I don't really have enough stats to be sure that it's a version thing)
If there's any chance it's related, we hit this bug on the MDS (also after an fsck) just before the upgrade: https://bugzilla.lustre.org/show_bug.cgi?id=19091 It was preventing the MDS/MGS from starting after the fsck (but before the upgrade), but since bugzilla mentioned there was a related fix in 1.6.7.1 we proceeded with the upgrade and the MDS started fine after that... There are still some odd messages in the MDS log though - see the bottom log segment below. Any ideas out there? Thanks, Tim ----- On the client: (hand transcribed, please forgive any typos) LustreError: 647:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-172.16.0....@tcp, match 115 length 1168 too big: 992 left, 992 allowed Lustre: Request x115 sent from p1-MDT0000-mdc-ffff81012a031000 to NID 172.16.0....@tcp 100s has timed out (limit 100s) Lustre: p1-MDT0000-mdc-ffff81012a031000: Connection to service prod_mds_001 via nid 172.16.0....@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: p1-MDT0000-mdc-ffff81012a031000: connection restored to service prod_mds_001 using nid 172.16.0....@tcp and then repeat... On the servers: Jun 22 19:01:29 mds001 kernel: LustreError: 3389:0:(service.c:611:ptlrpc_check_req()) @@@ DROPPING req from old connection 309 < 310 r...@ffff81010965dc00 x77181/t0 o400->12dffd61-75ec-a926-c333-3c3d8acf9...@net_0x20000ac100453_uuid:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Jun 22 19:01:29 mds001 kernel: LustreError: 3389:0:(service.c:611:ptlrpc_check_req()) Skipped 3 previous similar messages Jun 22 19:02:06 mds001 kernel: Lustre: 3359:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: 23127a45-3e3a-5b92-dba5-c7444d593e7f reconnecting Jun 22 19:02:06 mds001 kernel: Lustre: 3359:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 77 previous similar messages Jun 22 19:02:25 oss019 kernel: Lustre: 3417:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0012: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss020 kernel: Lustre: 3370:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0013: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss014 kernel: Lustre: 3263:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST000d: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss025 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0018: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss024 kernel: Lustre: 3901:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0017: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss029 kernel: Lustre: 3879:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001c: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss028 kernel: Lustre: 3909:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001b: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss010 kernel: Lustre: 3462:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0009: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss021 kernel: Lustre: 3933:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0014: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss022 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0015: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss023 kernel: Lustre: 3928:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0016: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss030 kernel: Lustre: 3854:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001d: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss027 kernel: Lustre: 3907:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST001a: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss026 kernel: Lustre: 3914:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0019: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss018 kernel: Lustre: 3379:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0011: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss016 kernel: Lustre: 3268:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST000f: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss017 kernel: Lustre: 3402:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0010: 8f3b0b35-1636-5355-671e-96c33c4017fd reconnecting Jun 22 19:02:25 oss010 kernel: Lustre: 3462:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss016 kernel: Lustre: 3268:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss018 kernel: Lustre: 3379:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss017 kernel: Lustre: 3402:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-OST0010: dcb418b0-12c5-61d2-ab8c-f9f3ced8130a reconnecting Jun 22 19:02:25 oss030 kernel: Lustre: 3854:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss025 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss023 kernel: Lustre: 3928:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 2 previous similar messages Jun 22 19:02:25 oss022 kernel: Lustre: 3904:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss021 kernel: Lustre: 3933:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss026 kernel: Lustre: 3914:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:02:25 oss027 kernel: Lustre: 3907:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 22 19:03:06 oss022 kernel: Lustre: p1-OST0015: haven't heard from client 6aaa9429-5a2c-9c20-1fe8-e42c3d108882 (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:06 oss011 kernel: Lustre: p1-OST000a: haven't heard from client 6aaa9429-5a2c-9c20-1fe8-e42c3d108882 (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:06 oss011 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:06 oss022 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss003 kernel: Lustre: p1-OST0002: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 mds001 kernel: Lustre: MGS: haven't heard from client d14860df-7906-9a56-5c84-79b25b9cc99e (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss007 kernel: Lustre: p1-OST0006: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 mds001 kernel: Lustre: Skipped 2 previous similar messages Jun 22 19:03:07 oss006 kernel: Lustre: p1-OST0005: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss005 kernel: Lustre: p1-OST0004: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss021 kernel: Lustre: p1-OST0014: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss026 kernel: Lustre: p1-OST0019: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss024 kernel: Lustre: p1-OST0017: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss013 kernel: Lustre: p1-OST000c: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss029 kernel: Lustre: p1-OST001c: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss027 kernel: Lustre: p1-OST001a: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss030 kernel: Lustre: p1-OST001d: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss009 kernel: Lustre: p1-OST0008: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss012 kernel: Lustre: p1-OST000b: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss019 kernel: Lustre: p1-OST0012: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss020 kernel: Lustre: p1-OST0013: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss018 kernel: Lustre: p1-OST0011: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss028 kernel: Lustre: p1-OST001b: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss002 kernel: Lustre: p1-OST0001: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss025 kernel: Lustre: p1-OST0018: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss017 kernel: Lustre: p1-OST0010: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss001 kernel: Lustre: p1-OST0000: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss014 kernel: Lustre: p1-OST000d: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss004 kernel: Lustre: p1-OST0003: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss023 kernel: Lustre: p1-OST0016: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss016 kernel: Lustre: p1-OST000f: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss010 kernel: Lustre: p1-OST0009: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss008 kernel: Lustre: p1-OST0007: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss015 kernel: Lustre: p1-OST000e: haven't heard from client b7f3778d-1615-4e89-2829-5021086f51cf (at 172.16....@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jun 22 19:03:07 oss018 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss020 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss012 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss019 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss014 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss017 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss007 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss003 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss001 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss009 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss016 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss004 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss015 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss010 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss030 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss029 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss027 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss028 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss021 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss025 kernel: Lustre: Skipped 1 previous similar message Jun 22 19:03:07 oss023 kernel: Lustre: Skipped 1 previous similar message Possibly still related to the earlier problem, we have this sort of thing appearing in the server logs too: Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(llog_obd.c:226:llog_add()) Skipped 351 previous similar messages Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(lov_log.c:118:lov_llog_origin_add()) Can't add llog (rc = -19) for stripe 0 Jun 21 11:47:56 mds001 kernel: LustreError: 4040:0:(lov_log.c:118:lov_llog_origin_add()) Skipped 351 previous similar messages Jun 21 11:48:04 mds001 kernel: Lustre: 4130:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:48:51 mds001 kernel: LustreError: 3624:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45e0e2f sub-object on OST idx 15/1: rc = -110 Jun 21 11:49:44 mds001 kernel: Lustre: 4132:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:50:21 mds001 kernel: LustreError: 4151:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:50:21 mds001 kernel: LustreError: 4151:0:(lov_log.c:118:lov_llog_origin_add()) Can't add llog (rc = -19) for stripe 0 Jun 21 11:50:54 mds001 kernel: LustreError: 3631:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x51c0136 sub-object on OST idx 15/1: rc = -110 Jun 21 11:51:24 mds001 kernel: Lustre: 3644:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:51:50 mds001 kernel: LustreError: 4075:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 11:51:50 mds001 kernel: LustreError: 4075:0:(lov_log.c:118:lov_llog_origin_add()) Can't add llog (rc = -19) for stripe 0 Jun 21 11:53:05 mds001 kernel: Lustre: 4077:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:54:10 mds001 kernel: LustreError: 4128:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45f118f sub-object on OST idx 15/1: rc = -110 Jun 21 11:54:10 mds001 kernel: LustreError: 4128:0:(lov_request.c:692:lov_update_create_set()) Skipped 1 previous similar message Jun 21 11:54:45 mds001 kernel: Lustre: 4039:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:56:25 mds001 kernel: Lustre: 4147:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:58:05 mds001 kernel: Lustre: 4097:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 11:59:46 mds001 kernel: Lustre: 4075:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 12:03:06 mds001 kernel: Lustre: 4158:0:(ldlm_lib.c:541:target_handle_reconnect()) p1-MDT0000: b88f4d25-7ba1-eaf0-6ddb-e0b12b04a934 reconnecting Jun 21 12:03:06 mds001 kernel: Lustre: 4158:0:(ldlm_lib.c:541:target_handle_reconnect()) Skipped 1 previous similar message Jun 21 12:05:40 mds001 kernel: LustreError: 4057:0:(lov_request.c:692:lov_update_create_set()) error creating fid 0x45e0ff4 sub-object on OST idx 15/1: rc = -110 Jun 21 12:05:40 mds001 kernel: LustreError: 4057:0:(lov_request.c:692:lov_update_create_set()) Skipped 1 previous similar message Jun 21 12:07:20 mds001 kernel: LustreError: 4071:0:(llog_obd.c:226:llog_add()) No ctxt Jun 21 12:07:20 mds001 kernel: LustreError: 4071:0:(llog_obd.c:226:llog_add()) Skipped 8 previous similar messages Cheers, Tim On Tue, Jun 9, 2009 at 3:55 AM, Michael D. Seymour<[email protected]> wrote: > Alexey Lyashkov wrote: >> Hi Michael, >> >>>> On Fri, 2009-05-22 at 16:38 -0400, Michael D. Seymour wrote: >>>>> Hi all, >>>>> >>>>> One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on >>>>> a >>>>> different network. >>>>> >>>>> We get the following messages on a particular client: >>>>> >>>>> May 22 15:07:45 trinity kernel: LustreError: >>>>> 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from >>>>> 12345-10.5.203....@tcp, match 19154486 length 728 too big: 704 left, 704 >>>>> allowed >>>> what frequently for this bug? >>> Sets of entries (about 20) happen a few times per day, each entry spaced >>> about >>> ten minutes apart. >> can you please show syslog messages around this time - should be exist >> lines with errors related to 'match XXXXX' (in this example match >> 19154486 -- should be something about request x19154486). > > I've upgraded the MDS to 1.6.7.1. So far no issues. I will probably upgrade to > 1.8 very soon. Will write back if there is still problems. > > Mike > > > -- > Michael D. Seymour Phone: 416-978-8497 > Scientific Computing Support Fax: 416-978-3921 > Canadian Institute for Theoretical Astrophysics, University of Toronto > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
