[Lustre-discuss] reading file hangs on Lustre 1.6.4 node

Anatoly Oreshkin Wed, 12 Dec 2007 07:53:14 -0800

Hello,

I am a novice to Lustre.
I've installed Lustre 1.6.4 on Scientific Linux 4.4 with
kernel 2.6.9-55.0.9.EL_lustre.1.6.4smp


MGS server, MDS server and OST server all are installed on  head node.
MGS and MDS servers have their storage  on different disks.

MGS server  on /dev/sdb1 disk
/usr/sbin/mkfs.lustre --fsname=vtrak1fs  --mgs /dev/sdb1

MDS server on /dev/sdc1
/usr/sbin/mkfs.lustre --fsname=vtrak1fs --mdt [EMAIL PROTECTED] 
/dev/sdc1

OST storage is based on RAID5 and connected via SCSI directly to head node.
OST1 server on /dev/sdg1

/usr/sbin/mkfs.lustre --fsname=vtrak1fs --ost [EMAIL PROTECTED] 
/dev/sdg1


On client node Lustre is started by mount

mount -t lustre [EMAIL PROTECTED]:/vtrak1fs /vtrak1

TCP networking is used for communication with nodes.
The file /etc/modprobe.conf contains the line:

options lnet networks=tcp

Command /usr/sbin/lctl list_nids issued on head node gives

[EMAIL PROTECTED]

For testing purpose I was reading all files on head node from OST1.
All files were read successfuly.

Then I started the same read test of all files from OST1 on client node
with address 192.168.1.2

Command /usr/sbin/lctl list_nids issued on client node gives:
[EMAIL PROTECTED]

In this case read test reads a number of files and then hangs on some file.
The command dmesg issued on client node gives such error messages:

LustreError: 5017:0:(socklnd.c:1599:ksocknal_destroy_conn()) Completing partial
receive from [EMAIL PROTECTED], ip 85.142.10.197:988, with error
LustreError: 5017:0:(events.c:134:client_bulk_callback()) event type 1, status 
-5, desc ca9d3c00
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1197447164, 150s ago)  [EMAIL PROTECTED] x4566962/t0 
o3->[EMAIL PROTECTED]@tcp:28 lens 384/336 ref 2 fl Rpc:/0/0 
rc 0/-22
LustreError: 5019:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 8 
previous similar messages
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection to service vtrak1fs-OST0000 
via nid [EMAIL PROTECTED] was lost; in progress operations using this service 
will wait for recovery to complete.
Lustre: vtrak1fs-OST0000-osc-f7f53200: Connection restored to service 
vtrak1fs-OST0000 using nid [EMAIL PROTECTED]
Lustre: Skipped 1 previous similar message
hw tcp v4 csum failed
hw tcp v4 csum failed
...


Dmesg issued on head node gives errors:

LustreError: 15048:0:(ost_handler.c:821:ost_brw_read()) @@@ timeout on bulk PUT
  [EMAIL PROTECTED] x4566962/t0 
o3->[EMAIL PROTECTED]:-1 lens 
384/336 ref 0 fl Interpret:/0/0 rc 0/0
Lustre: 15048:0:(ost_handler.c:881:ost_brw_read()) vtrak1fs-OST0000: ignoring 
bulk IO comm error with 
[EMAIL PROTECTED] id 
[EMAIL PROTECTED] - client will retry
Lustre: 14987:0:(ldlm_lib.c:519:target_handle_reconnect()) vtrak1fs-OST0000: 
629198c9-085d-f95a-462f-b5e535904a3d reconnecting

On Lustre client data checksums are disabled by default.

cat /proc/fs/lustre/llite/vtrak1fs-f7f53200/checksum_pages -> 0

What might be the reason(s)  ?

Any hints ?  How to trace the problem ?

Thank you.



_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

[Lustre-discuss] reading file hangs on Lustre 1.6.4 node

Reply via email to