On 2009-11-26, at 12:40, Goranka Bilalbegovic wrote: > Recently the cluster I am using for computing has been updated to > the VMware with the Lustre file system. Cluster uses: Oscar 6.0.3, > Sun Grid Engine 6.2u3, Nagios, Ganglia, InfiniBand 10 Gb/s. Nodes > access the file system using Ethernet via the Lustre InfiniBand/ > Ethernet router. > > I used to run one type of jobs as: > --- > #$ -N name > #$ -o namesys.out > #$ -e namesys.err > #$ -pe mpi 2 > #$ -cwd > #$ -v LD_LIBRARY_PATH > mpirun -machinefile $TMPDIR/machines -np $NSLOTS /path/.../code.x << > EOF > name.in > name.out > EOF > --- > > This is for an open source package (written in Fortran plus some C > utilities) and a such way of running was recommended by authors. It > was working on the previous version of the cluster, but it does not > run on a new lustre filesystem. It starts, but then stays in the > queue forever.
Without more information it is impossible to know what the problem is. There shouldn't be any problem with running executables from Lustre, General debugging steps that should be followed (not strictly related to this problem): - presumably the Lustre filesystem is accessible from within your VM and is working fine other than this job launch problem? - try to run the job by hand to see if it really is a Lustre problem or if it is related to the batch scheduler or something else - check /var/log/messages to see if there are Lustre (or other) errors - do "echo t > /proc/sysrq-trigger" to dump the stacks of all processes on the system, and see where your job is stuck Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
