Hi all, I hope you could help us with some connection problems we are having with our lustre file system. The filesystem roc consists of 6 OSSs with one OST per OSS. Each OSS uses the 1.6.7 RHEL 5 kernel on Centos 5.1 (one unit uses Centos 5.3). The MDS uses CentOS 5.1 and Lustre 1.6.7. 203 RHEL-based clients mount the filesystem and all use Lustre 1.6.7. All are connected via a Gb ethernet switch stack.
One client running CentOS 5.2 re-exports the Lustre filesystem via NFS on a different network. We get the following messages on a particular client: May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Matching packet from 12345-10.5.203....@tcp, match 19154486 length 728 too big: 704 left, 704 allowed May 22 15:07:45 trinity kernel: LustreError: 5111:0:(lib-move.c:110:lnet_try_match_md()) Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: Request x19154486 sent from roc-MDT0000-mdc-000001044e1d4c00 to NID 10.5.203....@tcp 300s ago has timed out (limit 300s). May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: Connection to service roc-MDT0000 via nid 10.5.203....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 22 15:12:45 trinity kernel: Lustre: Skipped 3 previous similar messages May 22 15:12:45 trinity kernel: Lustre: roc-MDT0000-mdc-000001044e1d4c00: Connection restored to service roc-MDT0000 using nid 10.5.203....@tcp. May 22 15:12:45 trinity kernel: Lustre: Skipped 4 previous similar messages [r...@trinity ~]# cat /proc/fs/lustre/lov/roc-clilov-000001044e1d4c00/uuid 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 On the MDS containing roc-MDT0000: May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) roc-MDT0000: 84adb9a1-8959-fcf5-cc72-81c6a1e171b8 reconnecting May 22 15:12:45 rocpile kernel: Lustre: 19236:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar messages Any idea what could be causing this? BUG 11332 looked similar, but it has been closed because of other related bugs being fixed. Thanks, Mike -- Michael D. Seymour Phone: 416-978-8497 Scientific Computing Support Fax: 416-978-3921 Canadian Institute for Theoretical Astrophysics, University of Toronto _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
