Hi Jagga, On Wed, Oct 13, 2010 at 02:33:35PM -0700, Jagga Soorma wrote: .. >start seeing this issue. All my clients are setup with SLES11 and the same >packages with the exception of a newer kernel in the 1.8.4 environment due >to the lustre dependency: > >reshpc208:~ # uname -a >Linux reshpc208 2.6.27.39-0.3-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 >x86_64 x86_64 GNU/Linux ... >open("/proc/9598/stat", O_RDONLY) = 6 >read(6, "9598 (gsnap) S 9596 9589 9589 0 "..., 1023) = 254 >close(6) = 0 >open("/proc/9598/status", O_RDONLY) = 6 >read(6, "Name:\tgsnap\nState:\tS (sleeping)\n"..., 1023) = 1023 >close(6) = 0 >open("/proc/9598/cmdline", O_RDONLY) = 6 >read(6,
did you get any further with this? we've just seen something similar in that we had D state hung processes and a strace of ps hung at the same place. in the end our hang appeared to be /dev/shm related, and an 'ipcs -ma' magically caused all the D state processes to continue... we don't have a good idea why this might be. looks kinda like a generic kernel shm deadlock, possibly unrelated to Lustre. sys_shmdt features in the hung process tracebacks that the kernel prints out. if you do 'lsof' do you see lots of /dev/shm entries for your app? the app we saw run into trouble was using HPMPI which is common in commercial packages. does gsnap use HPMPI? we are running vanilla 2.6.32.* kernels with Lustre 1.8.4 clients on this cluster. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss