try mounting the lustre filesystem with -o flock or -o localflock
On Thu, May 5, 2011 at 4:47 AM, Christopher Walker <[email protected]> wrote: > Hello, > > We have a user who is trying to post-process HDF files in R. Her script > goes through a number (~2500) of files in a directory, opening and > reading the contents. This usually goes fine, but occasionally the > script dies with: > > > HDF5-DIAG: Error detected in HDF5 (1.9.4) thread 46944713368080: > #000: H5F.c line 1560 in H5Fopen(): unable to open file > major: File accessability > minor: Unable to open file > #001: H5F.c line 1337 in H5F_open(): unable to read superblock > major: File accessability > minor: Read failed > #002: H5Fsuper.c line 542 in H5F_super_read(): truncated file > major: File accessability > minor: File has been truncated > Error in hdf5load(file = myfile, load = FALSE, verbosity = 0, tidy = > TRUE) : > unable to open HDF file: > /n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5 > HDF5-DIAG: Error detected in HDF5 (1.9.4) thread 46944713368080: > #000: H5F.c line 2012 in H5Fclose(): decrementing file ID failed > major: Object atom > minor: Unable to close file > #001: H5I.c line 1340 in H5I_dec_ref(): can't locate ID > major: Object atom > minor: Unable to find atom information (already closed?) > Error in hdf5cleanup(16778754L) : unable to close HDF file > > > But this file definitely does exist -- any stat or ls command shows it > without a problem. Further, once I 'ls' this file, if I rerun the same > script, it successfully reads this file, but then dies on the next one > with the same error. If I 'ls' the entire directory, the script runs to > completion without a problem. strace output shows: > > open("/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5", > O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 > lseek(3, 0, SEEK_SET) = 0 > read(3, "\211HDF\r\n\32\n", 8) = 8 > read(3, "\0", 1) = 1 > read(3, > "\0\0\0\0\10\10\0\4\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377@"..., > 87) = 87 > close(3) = 0 > write(2, "HDF5-DIAG: Error detected in HDF"..., 42) = 42 > etc > > which initially looks fine to me, followed by an abrupt close. > > NFS filesystems and our 1.6.7.2 filesystem have no such problems -- any > suggestions? > > Thanks very much, > Chris > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
