We're seeing errors which we believe are down to automount returning too early 
from a Lustre mount.

We're using autofs so the Lustre may be mounted instantly before the command 
using it is run.  We believe it may be because the client has not yet 
established connections to all the OSTs when mount returns and the following 
command is run.

We've tried creating an automounter module based on mount_generic that simply 
puts a 1s delay in the mount, and that's reduced the number of errors, but 
they're very much still there.  Putting in a larger delay is an option, but 
fairly obviously a pretty bad one.

Once the filesystem is actually mounted, things will work properly, until that 
is, the automounter drops the mount again of course.

Pasted below are two example log excerpts where we've automounted a filesystem 
called /net/epsilon, then immediately tried to fopen() a file on it which gives 
an I/O error.

I've attached a tiny C program that can regularly replicate the issue (it 
happened on 16 machines when run with pdsh across a set of roughly 400 and this 
is fairly representative)

Any ideas or recommendations would be much appreciated,

Stephen



Mar 25 12:26:38 rr445 automount[6457]: open_mount: (mount):cannot open mount 
module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object 
file: No such file or directory)
Mar 25 12:26:38 rr445 kernel: Lustre: Client epsilon-client has started         
                                                                                
               
Mar 25 12:26:38 rr445 kernel: LustreError: 
22600:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning 
-EIO


Mar 25 12:26:37 rr447 automount[6458]: open_mount: (mount):cannot open mount 
module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object 
file: No such file or directory)                                                
                                                                                
                                     
Mar 25 12:26:37 rr447 kernel: Lustre: Client epsilon-client has started         
                                                                                
               
Mar 25 12:26:37 rr447 kernel: LustreError: 
2370:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO


-- 
Stephen Willey
Senior Systems Engineer
Framestore
19-23 Wells Street, London W1T 3PQ
+44 207 344 8000
www.framestore.com 
/*
 *  * Immediately begin writing to a file on disk, to test Lustre
 *   */

#include <stdio.h>
#include <string.h>

#define DATA "kjrlewkujriojfjvclsdjfoiewujfdkjljvoisjvowjfelkjelwkjvfljifwedse"

int main(int argc, char *argv[])
{
	FILE *f;
	unsigned int times = 512;

	if (argc != 2) {
		fprintf(stderr, "Usage: %s <pathname>\n", argv[0]);
		return -1;
	}

	f = fopen(argv[1], "a"); /* creates if necessary */
	if (f == NULL) {
		perror("fopen");
		return -1;
	}

	while (times != 0) {
		if (fwrite(DATA, strlen(DATA), 1, f) != 1) {
			perror("fwrite");
			return -1;
		}
		times--;
	}

	if (fclose(f) != 0) {
		perror("fclose");
		return -1;
	}

	return 0;
}
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to