We're seeing errors which we believe are down to automount returning too early from a Lustre mount.
We're using autofs so the Lustre may be mounted instantly before the command using it is run. We believe it may be because the client has not yet established connections to all the OSTs when mount returns and the following command is run. We've tried creating an automounter module based on mount_generic that simply puts a 1s delay in the mount, and that's reduced the number of errors, but they're very much still there. Putting in a larger delay is an option, but fairly obviously a pretty bad one. Once the filesystem is actually mounted, things will work properly, until that is, the automounter drops the mount again of course. Pasted below are two example log excerpts where we've automounted a filesystem called /net/epsilon, then immediately tried to fopen() a file on it which gives an I/O error. I've attached a tiny C program that can regularly replicate the issue (it happened on 16 machines when run with pdsh across a set of roughly 400 and this is fairly representative) Any ideas or recommendations would be much appreciated, Stephen Mar 25 12:26:38 rr445 automount[6457]: open_mount: (mount):cannot open mount module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object file: No such file or directory) Mar 25 12:26:38 rr445 kernel: Lustre: Client epsilon-client has started Mar 25 12:26:38 rr445 kernel: LustreError: 22600:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO Mar 25 12:26:37 rr447 automount[6458]: open_mount: (mount):cannot open mount module lustre (/usr/lib64/autofs/mount_lustre.so: cannot open shared object file: No such file or directory) Mar 25 12:26:37 rr447 kernel: Lustre: Client epsilon-client has started Mar 25 12:26:37 rr447 kernel: LustreError: 2370:0:(file.c:993:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO -- Stephen Willey Senior Systems Engineer Framestore 19-23 Wells Street, London W1T 3PQ +44 207 344 8000 www.framestore.com
/* * * Immediately begin writing to a file on disk, to test Lustre * */ #include <stdio.h> #include <string.h> #define DATA "kjrlewkujriojfjvclsdjfoiewujfdkjljvoisjvowjfelkjelwkjvfljifwedse" int main(int argc, char *argv[]) { FILE *f; unsigned int times = 512; if (argc != 2) { fprintf(stderr, "Usage: %s <pathname>\n", argv[0]); return -1; } f = fopen(argv[1], "a"); /* creates if necessary */ if (f == NULL) { perror("fopen"); return -1; } while (times != 0) { if (fwrite(DATA, strlen(DATA), 1, f) != 1) { perror("fwrite"); return -1; } times--; } if (fclose(f) != 0) { perror("fclose"); return -1; } return 0; }
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss