Hello, we have a _test_ setup for a lustre 1.6.5.1 installation with 2 Raid Systems (64 Bit Systems) counting for 4 OSTs with 6TB each. One combined MDS and MDT server (32 Bit system , for testing only).
OST lustre mkfs: "mkfs.lustre --param="failover.mode=failout" --fsname scia --ost --mkfsoptions='-i 2097152 -E stride=16 -b 4096' [EMAIL PROTECTED] /dev/sdb" (Our files are quite large 100MB+ on the system) Kernel: Vanilla Kernel 2.6.22.19, lustre compiled from the sources on Gentoo 2008.0 The client mount point is /misc/testfs via automount. The access can be done through a link from /mnt/testfs -> /misc/testfs The following procedure hangs a client: 1) copy files to the lustre system 2) do a 'du -sh /mnt/testfs/willi' while copying 3) unmount an OST (here OST0003) while copying The 'du' job hangs and the lustre file system cannot be acessed any longer on this client even from other logins. The only way to restore normal op is IMHO a hard reset of the machine. A reboot hangs because the filesystem is still active. Other clients and there mount points are not affected as long as they do not access the file system with 'du' 'ls' or so. I know that this is drastic but may happen in production by our users. Deactivating/Reactivating or remounting the OST does not have any effect on the 'du' job. The 'du' job (#29665 see process list below) and the correpsonding lustre thread (#29694) cannot be killed manually. This behaviour is reproducable. The OST0003 is not reactivated on the client side though the MDS does so. It seems that this info does not propagate to the client. See last lines of dmesg below. What is the proper way (besides avoiding the use of 'du') to reactivate the client file system ? Thanks and Regards Heiko The process list on the CLIENT: <snip> root 29175 5026 0 08:36 ? 00:00:00 sshd: laura [priv] laura 29177 29175 0 08:36 ? 00:00:01 sshd: [EMAIL PROTECTED]/0 laura 29178 29177 0 08:36 pts/0 00:00:00 -bash laura 29665 29178 0 09:15 pts/0 00:00:03 du -sh /mnt/testfs/foo/fam/ schell 29694 2 0 09:15 ? 00:00:00 [ll_sa_29665] root 29695 4846 0 09:15 ? 00:00:00 /usr/sbin/automount --timeout 60 --pid-file /var/run/autofs.misc.pid /misc yp auto.misc <snap> and CLIENT dmesg: Lustre: 5361:0:(import.c:395:import_select_connection()) scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency to 6s Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 10 previous similar messages LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ost_connect operation failed with -19 LustreError: Skipped 20 previous similar messages Lustre: 5361:0:(import.c:395:import_select_connection()) scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency to 51s Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 20 previous similar messages LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ost_connect operation failed with -19 LustreError: Skipped 24 previous similar messages Lustre: 5361:0:(import.c:395:import_select_connection()) scia-OST0003-osc-ffff8100ea24a000: tried all connections, increasing latency to 51s Lustre: 5361:0:(import.c:395:import_select_connection()) Skipped 24 previous similar messages LustreError: 167-0: This client was evicted by scia-OST0003; in progress operations using this service will fail. The MDS dmesg: <snip> Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: tried all connections, increasing latency to 51s Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 10 previous similar messages LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The ost_connect operation failed with -19 LustreError: Skipped 10 previous similar messages Lustre: 6108:0:(import.c:395:import_select_connection()) scia-OST0003-osc: tried all connections, increasing latency to 51s Lustre: 6108:0:(import.c:395:import_select_connection()) Skipped 20 previous similar messages Lustre: Permanently deactivating scia-OST0003 Lustre: Setting parameter scia-OST0003-osc.osc.active in log scia-client Lustre: Skipped 3 previous similar messages Lustre: setting import scia-OST0003_UUID INACTIVE by administrator request Lustre: scia-OST0003-osc.osc: set parameter active=0 Lustre: Skipped 2 previous similar messages Lustre: scia-MDT0000: haven't heard from client 9111f740-b7a7-e2ff-b672-288a66decfab (at [EMAIL PROTECTED]) in 1269 seconds. I think it's dead, and I am evicting it. Lustre: Permanently reactivating scia-OST0003 Lustre: Modifying parameter scia-OST0003-osc.osc.active in log scia-client Lustre: Skipped 1 previous similar message Lustre: 15406:0:(import.c:395:import_select_connection()) scia-OST0003-osc: tried all connections, increasing latency to 51s Lustre: 15406:0:(import.c:395:import_select_connection()) Skipped 2 previous similar messages LustreError: 167-0: This client was evicted by scia-OST0003; in progress operations using this service will fail. Lustre: scia-OST0003-osc: Connection restored to service scia-OST0003 using nid [EMAIL PROTECTED] Lustre: scia-OST0003-osc.osc: set parameter active=1 Lustre: MDS scia-MDT0000: scia-OST0003_UUID now active, resetting orphans <snap> _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
