Hello, I am a novice to Lustre. I've installed Lustre 1.6.4 on Scientific Linux 4.4 with kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp
MGS server, MDS server and OST server all are installed on head node. MGS and MDS servers have their storage on different disks. MGS server on /dev/sdb1 disk /usr/sbin/mkfs.lustre --fsname=vtrak1fs --mgs /dev/sdb1 MDS server on /dev/sdc1 /usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt [EMAIL PROTECTED] /dev/sdc1 OST storage is based on RAID5 and connected via SCSI directly to head node. OST1 server on /dev/sdg1 /usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost [EMAIL PROTECTED] /dev/sdg1 On client node Lustre is started by mount mount -t lustre [EMAIL PROTECTED]:/vtrak1fs /vtrak1 TCP networking is used for communication with nodes. The file /etc/modprobe.conf contains the line: options lnet networks=tcp Command /usr/sbin/lctl list_nids issued on head node gives [EMAIL PROTECTED] For testing purpose I was reading all files on head node from OST1. All files were read successfuly. Then I started the same read test of all files from OST1 on client node with address 192.168.1.2 Command /usr/sbin/lctl list_nids issued on client node gives: [EMAIL PROTECTED] In this case read test reads a number of files and then hangs on some file. The command dmesg issued on client node gives such error messages: LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial receive from [EMAIL PROTECTED], ip 85.142.10.197:988, with error LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status -5, desc ca9d3c00 LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1197447164, 150s ago) [EMAIL PROTECTED] x4566962/t0 o3->[EMAIL PROTECTED]@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 rc 0/-22 LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000 via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service vtrak1fs-OST0000 using nid [EMAIL PROTECTED] Lustre: Skipped 1 previous similar message hw tcp v4 csum failed hw tcp v4 csum failed ... Dmesg issued on head node gives errors: LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT [EMAIL PROTECTED] x4566962/t0 o3->[EMAIL PROTECTED]:-1 lens 384/336 ref 0 fl Interpret:/0/0 rc 0/0 Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring bulk IO comm error with [EMAIL PROTECTED] id [EMAIL PROTECTED] - client will retry Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 629198c9-085d-f95a-462f-b5e535904a3d reconnecting On Lustre client data checksums are disabled by default. cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0 What might be the reason(s) ? Any hints ? How to trace the problem ? Thank you. _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
