Dear Laurence, Thank you for assisting us in our problems, we are still facing problems, I know it is not related to the code, but we still need the help from experts in parallel compilation.
The code works just fine on a cluster where we have compiled it using the extra '-assu buff' flag to alleviate NFS problems. However, on another cluster that has a GPFS parallel file system and does not use NFS the only problem that we encounter is the following compile issue: Compile time errors (if any) were: SRC_vecpratt/compile.msg: signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */ SRC_vecpratt/compile.msg: signal ( SIGBUS, w2ksignal_bus ); /* Bus Error */ Check file compile.msg in the corresponding SRC_* directory for the compilation log and more info on any compilation problem. I'm not sure what the vecpratt part of the code actually does. When I look in the compile.msg file there are just warnings, not errors and looking in the SRC_vecpratt directory I see that the executables are actually built. where is vecpratt used by the program? Thanks in advance Bothina --- On Wed, 7/21/10, Laurence Marks <L-marks at northwestern.edu> wrote: > From: Laurence Marks <L-marks at northwestern.edu> > Subject: Re: [Wien] Problems in parallel jobs > To: "A Mailing list for WIEN2k users" <wien at zeus.theochem.tuwien.ac.at> > Date: Wednesday, July 21, 2010, 3:03 PM > Also: > > 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to > check that > these are correct, and also "which mpirun". > > On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks > <L-marks at northwestern.edu> > wrote: > > Hard to know for certain, but this looks like a OS > problem rather than > > a WIen2k issue. Things to check: > > > > 1) Use ompi_info and check, carefully, that the > compilation > > options/libraries used for openmpi are the same as > what you are using > > to compile Wien2k. > > > > 2) For 10.1, ulimit -s is not needed for mpi (and in > any case does > > nothing with openmpi) as this is done in software in > Wien2kutils.c. > > Make sure that you are exporting environmental > parameters in your > > mpirun call, for instance use in parallel_options > > setenv WIEN_MPIRUN "mpirun -x LD_LIBRARY_PATH -x PATH > -np _NP_ > > -machinefile _HOSTS_ _EXEC_" > > > > 3) Check the size of the job you are running, e.g. via > top, by looking > > in case.output1_X. by using "lapw1 -p -nmat_only", > using ganglia or > > nmon, cat /proc/meminfo (or anything else you have > available). > > Particularly with openmpi but with some other flavors > as well, if you > > are asking for too much memory and/or have too many > processes running, > > problems occur. A race condition can also occur in > openmpi which makes > > this problem worse (maybe patched in latest version, I > am not sure). > > > > 4) Check, carefully (twice) for format errors in the > input files. It > > turns out that ifort has it's own signal traps so a > child can exit > > without correctly calling mpi_abort. A race condition > can occur with > > openmpi when the parent is trying to find a child, the > child does not > > exist, the parent waits then keeps going.... > > > > 5) Check the OS logs in /var/log (beyond my > competence). You may have > > too high a nfs load, bad infiniband/myrinet (recent > OFED?) etc. Use > > -assu buff in compilation options to reduce nfs load. > > > > On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad <both_hamad at yahoo.com> > wrote: > >> Dear Wien users, > >> > >> When running optimisation jobs under torque > queuing system for anything but > >> very small systems: > >> > >> Job runs for many cycles using lapw0, lapw1, lapw2 > (parallel) successfully but eventually the 'mom-superior' > node (that launches ) mpirun becomes non-communicating with > the other nodes involved with the job. > >> > >> At the console of this node there is correct load > (4 for quad processor) and memory free... but can no longer > access any nfs mounts, can no longer ping other nodes in > cluster... am eventually forced to reboot node and kill job > from cluster queuing system (job enters 'E' state and stays > there... need to stop pbs_server and manually remove > jobfiles from /var/spool/torque/server_priv/jobs... then > restart pbs_server) > >> > >> A similar problem is encountered on larger cluster > (same install procedure) but with added problem that the > .dayfile reports that for lapw2 only the 'mom-superior' node > is reporting doing work (even though logging into other job > nodes top reports correct load and 100%cpu use). > >> > >> DOS calculation seems to work properly on both > clusters... > >> > >> We have used a modified x_lapw that you provided > earlier. > >> We have been inserting 'ulimit -s unlimited' into > job-scripts > >> > >> We are using... > >> Centos5.3 x86_64 > >> Intel compiler suite with mkl v11.1/072 > >> openmpi-1.4.2, compiled with intel compilers > >> fftw-2.1.5, compiled with intel compilers and > openmpi above > >> Wien2k v10.1 > >> > >> Optimisation jobs for small systems complete OK on > both clusters. > >> > >> The working directories for this job are large > (>2GB). > >> > >> ?Please let us know what > >> files we could send you from these that may be > helpful for diagnosis... > >> > >> Best regards > >> Bothina > >> > >> > >> > >> _______________________________________________ > >> Wien mailing list > >> Wien at zeus.theochem.tuwien.ac.at > >> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien > >> > > > > > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Chair, Commission on Electron Crystallography of IUCR > > www.numis.northwestern.edu/ > > Electron crystallography is the branch of science that > uses electron > > scattering and imaging to study the structure of > matter. > > > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Chair, Commission on Electron Crystallography of IUCR > www.numis.northwestern.edu/ > Electron crystallography is the branch of science that uses > electron > scattering and imaging to study the structure of matter. > _______________________________________________ > Wien mailing list > Wien at zeus.theochem.tuwien.ac.at > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien >