[Wien] Problems in parallel jobs

2010-07-27 Thread bothina hamad
Dear Laurence,
 Thank you for assisting us in our problems, we are still facing 
problems, I know it is not related to the code, but we still need the help from 
experts in parallel compilation.

The code works just fine on a cluster where we have compiled it using the extra 
'-assu buff' flag to alleviate NFS problems. 
However, on another cluster that has a GPFS parallel file system and does not 
use NFS the only problem that we encounter is the following compile issue:
 
 
 
 Compile time errors (if any) were:
 SRC_vecpratt/compile.msg:  signal ( SIGBUS, w2ksignal_bus ); /* Bus 
Error */
 SRC_vecpratt/compile.msg:  signal ( SIGBUS, w2ksignal_bus ); /* Bus 
Error */
 
 
 Check file   compile.msg   in the corresponding SRC_* directory for the 
 compilation log and more info on any compilation problem.
 
 
 I'm not sure what the vecpratt part of the code actually does.
 
 When I look in the compile.msg file there are just warnings, not errors and 
looking in the SRC_vecpratt directory I see that the executables are actually 
built. 
 
where is vecpratt used by the program?


Thanks in advance
Bothina 


--- On Wed, 7/21/10, Laurence Marks L-marks at northwestern.edu wrote:

 From: Laurence Marks L-marks at northwestern.edu
 Subject: Re: [Wien] Problems in parallel jobs
 To: A Mailing list for WIEN2k users wien at zeus.theochem.tuwien.ac.at
 Date: Wednesday, July 21, 2010, 3:03 PM
 Also:
 
 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to
 check that
 these are correct, and also which mpirun.
 
 On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks
 L-marks at northwestern.edu
 wrote:
  Hard to know for certain, but this looks like a OS
 problem rather than
  a WIen2k issue. Things to check:
 
  1) Use ompi_info and check, carefully, that the
 compilation
  options/libraries used for openmpi are the same as
 what you are using
  to compile Wien2k.
 
  2) For 10.1, ulimit -s is not needed for mpi (and in
 any case does
  nothing with openmpi) as this is done in software in
 Wien2kutils.c.
  Make sure that you are exporting environmental
 parameters in your
  mpirun call, for instance use in parallel_options
  setenv WIEN_MPIRUN mpirun -x LD_LIBRARY_PATH -x PATH
 -np _NP_
  -machinefile _HOSTS_ _EXEC_
 
  3) Check the size of the job you are running, e.g. via
 top, by looking
  in case.output1_X. by using lapw1 -p -nmat_only,
 using ganglia or
  nmon, cat /proc/meminfo (or anything else you have
 available).
  Particularly with openmpi but with some other flavors
 as well, if you
  are asking for too much memory and/or have too many
 processes running,
  problems occur. A race condition can also occur in
 openmpi which makes
  this problem worse (maybe patched in latest version, I
 am not sure).
 
  4) Check, carefully (twice) for format errors in the
 input files. It
  turns out that ifort has it's own signal traps so a
 child can exit
  without correctly calling mpi_abort. A race condition
 can occur with
  openmpi when the parent is trying to find a child, the
 child does not
  exist, the parent waits then keeps going
 
  5) Check the OS logs in /var/log (beyond my
 competence). You may have
  too high a nfs load, bad infiniband/myrinet (recent
 OFED?) etc. Use
  -assu buff in compilation options to reduce nfs load.
 
  On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad both_hamad at yahoo.com
 wrote:
  Dear Wien users,
 
  When running optimisation jobs under torque
 queuing system for anything but
  very small systems:
 
  Job runs for many cycles using lapw0, lapw1, lapw2
 (parallel) successfully but eventually the 'mom-superior'
 node (that launches ) mpirun becomes non-communicating with
 the other nodes involved with the job.
 
  At the console of this node there is correct load
 (4 for quad processor) and memory free... but can no longer
 access any nfs mounts, can no longer ping other nodes in
 cluster... am eventually forced to reboot node and kill job
 from cluster queuing system (job enters 'E' state and stays
 there... need to stop pbs_server and manually remove
 jobfiles from /var/spool/torque/server_priv/jobs... then
 restart pbs_server)
 
  A similar problem is encountered on larger cluster
 (same install procedure) but with added problem that the
 .dayfile reports that for lapw2 only the 'mom-superior' node
 is reporting doing work (even though logging into other job
 nodes top reports correct load and 100%cpu use).
 
  DOS calculation seems to work properly on both
 clusters...
 
  We have used a modified x_lapw that you provided
 earlier.
  We have been inserting 'ulimit -s unlimited' into
 job-scripts
 
  We are using...
  Centos5.3 x86_64
  Intel compiler suite with mkl v11.1/072
  openmpi-1.4.2, compiled with intel compilers
  fftw-2.1.5, compiled with intel compilers and
 openmpi above
  Wien2k v10.1
 
  Optimisation jobs for small systems complete OK on
 both clusters.
 
  The working directories for this job are large
 (2GB).
 
  ?Please let us know

[Wien] Problems in parallel jobs

2010-07-27 Thread Laurence Marks
This is not an error, just something untidy in the code (which should
be cleaned up), not an error.

The compilation script does a grep -e Error on a file compile.msg
which sometimes (depends upon what c flags are used) gives a warning
for the two lines you mentioned. The grep picks up the material within
the /* and */ which is a c inline comment and prints it out in a
fashion which is misleading.

N.B., if you need to fix this just remove Error from
SRC_vecpratt/W2kutils.c so the relevant line just reads
signal ( SIGBUS, w2ksignal_bus ); /* Bus  */


On Tue, Jul 27, 2010 at 5:41 AM, bothina hamad both_hamad at yahoo.com wrote:
 Dear Laurence,
 ? ? ? ? ? ? Thank you for assisting us in our problems, we are still facing 
 problems, I know it is not related to the code, but we still need the help 
 from experts in parallel compilation.

 The code works just fine on a cluster where we have compiled it using the 
 extra '-assu buff' flag to alleviate NFS problems.
 However, on another cluster that has a GPFS parallel file system and does not 
 use NFS the only problem that we encounter is the following compile issue:



 ?Compile time errors (if any) were:
 ?SRC_vecpratt/compile.msg: ? ? ? ? ?signal ( SIGBUS, w2ksignal_bus ); /* Bus 
 Error */
 ?SRC_vecpratt/compile.msg: ? ? ? ? ?signal ( SIGBUS, w2ksignal_bus ); /* Bus 
 Error */


 ?Check file ? compile.msg ? in the corresponding SRC_* directory for the
 ?compilation log and more info on any compilation problem.


 ?I'm not sure what the vecpratt part of the code actually does.

 ?When I look in the compile.msg file there are just warnings, not errors and 
 looking in the SRC_vecpratt directory I see that the executables are actually 
 built.

 where is vecpratt used by the program?


 Thanks in advance
 Bothina


 --- On Wed, 7/21/10, Laurence Marks L-marks at northwestern.edu wrote:

 From: Laurence Marks L-marks at northwestern.edu
 Subject: Re: [Wien] Problems in parallel jobs
 To: A Mailing list for WIEN2k users wien at zeus.theochem.tuwien.ac.at
 Date: Wednesday, July 21, 2010, 3:03 PM
 Also:

 5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to
 check that
 these are correct, and also which mpirun.

 On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks
 L-marks at northwestern.edu
 wrote:
  Hard to know for certain, but this looks like a OS
 problem rather than
  a WIen2k issue. Things to check:
 
  1) Use ompi_info and check, carefully, that the
 compilation
  options/libraries used for openmpi are the same as
 what you are using
  to compile Wien2k.
 
  2) For 10.1, ulimit -s is not needed for mpi (and in
 any case does
  nothing with openmpi) as this is done in software in
 Wien2kutils.c.
  Make sure that you are exporting environmental
 parameters in your
  mpirun call, for instance use in parallel_options
  setenv WIEN_MPIRUN mpirun -x LD_LIBRARY_PATH -x PATH
 -np _NP_
  -machinefile _HOSTS_ _EXEC_
 
  3) Check the size of the job you are running, e.g. via
 top, by looking
  in case.output1_X. by using lapw1 -p -nmat_only,
 using ganglia or
  nmon, cat /proc/meminfo (or anything else you have
 available).
  Particularly with openmpi but with some other flavors
 as well, if you
  are asking for too much memory and/or have too many
 processes running,
  problems occur. A race condition can also occur in
 openmpi which makes
  this problem worse (maybe patched in latest version, I
 am not sure).
 
  4) Check, carefully (twice) for format errors in the
 input files. It
  turns out that ifort has it's own signal traps so a
 child can exit
  without correctly calling mpi_abort. A race condition
 can occur with
  openmpi when the parent is trying to find a child, the
 child does not
  exist, the parent waits then keeps going
 
  5) Check the OS logs in /var/log (beyond my
 competence). You may have
  too high a nfs load, bad infiniband/myrinet (recent
 OFED?) etc. Use
  -assu buff in compilation options to reduce nfs load.
 
  On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad both_hamad at yahoo.com
 wrote:
  Dear Wien users,
 
  When running optimisation jobs under torque
 queuing system for anything but
  very small systems:
 
  Job runs for many cycles using lapw0, lapw1, lapw2
 (parallel) successfully but eventually the 'mom-superior'
 node (that launches ) mpirun becomes non-communicating with
 the other nodes involved with the job.
 
  At the console of this node there is correct load
 (4 for quad processor) and memory free... but can no longer
 access any nfs mounts, can no longer ping other nodes in
 cluster... am eventually forced to reboot node and kill job
 from cluster queuing system (job enters 'E' state and stays
 there... need to stop pbs_server and manually remove
 jobfiles from /var/spool/torque/server_priv/jobs... then
 restart pbs_server)
 
  A similar problem is encountered on larger cluster
 (same install procedure) but with added problem that the
 .dayfile reports that for lapw2 only the 'mom-superior' node
 is reporting doing

[Wien] Problems in parallel jobs

2010-07-21 Thread bothina hamad
Dear Wien users,

When running optimisation jobs under torque queuing system for anything but
very small systems:

Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully but 
eventually the 'mom-superior' node (that launches ) mpirun becomes 
non-communicating with the other nodes involved with the job.

At the console of this node there is correct load (4 for quad processor) and 
memory free... but can no longer access any nfs mounts, can no longer ping 
other nodes in cluster... am eventually forced to reboot node and kill job from 
cluster queuing system (job enters 'E' state and stays there... need to stop 
pbs_server and manually remove jobfiles from 
/var/spool/torque/server_priv/jobs... then restart pbs_server)

A similar problem is encountered on larger cluster (same install procedure) but 
with added problem that the .dayfile reports that for lapw2 only the 
'mom-superior' node is reporting doing work (even though logging into other job 
nodes top reports correct load and 100%cpu use).

DOS calculation seems to work properly on both clusters...

We have used a modified x_lapw that you provided earlier.
We have been inserting 'ulimit -s unlimited' into job-scripts

We are using...
Centos5.3 x86_64
Intel compiler suite with mkl v11.1/072
openmpi-1.4.2, compiled with intel compilers
fftw-2.1.5, compiled with intel compilers and openmpi above
Wien2k v10.1

Optimisation jobs for small systems complete OK on both clusters.

The working directories for this job are large (2GB).

 Please let us know what
files we could send you from these that may be helpful for diagnosis...

Best regards
Bothina


  


[Wien] Problems in parallel jobs

2010-07-21 Thread Laurence Marks
Hard to know for certain, but this looks like a OS problem rather than
a WIen2k issue. Things to check:

1) Use ompi_info and check, carefully, that the compilation
options/libraries used for openmpi are the same as what you are using
to compile Wien2k.

2) For 10.1, ulimit -s is not needed for mpi (and in any case does
nothing with openmpi) as this is done in software in Wien2kutils.c.
Make sure that you are exporting environmental parameters in your
mpirun call, for instance use in parallel_options
setenv WIEN_MPIRUN mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
-machinefile _HOSTS_ _EXEC_

3) Check the size of the job you are running, e.g. via top, by looking
in case.output1_X. by using lapw1 -p -nmat_only, using ganglia or
nmon, cat /proc/meminfo (or anything else you have available).
Particularly with openmpi but with some other flavors as well, if you
are asking for too much memory and/or have too many processes running,
problems occur. A race condition can also occur in openmpi which makes
this problem worse (maybe patched in latest version, I am not sure).

4) Check, carefully (twice) for format errors in the input files. It
turns out that ifort has it's own signal traps so a child can exit
without correctly calling mpi_abort. A race condition can occur with
openmpi when the parent is trying to find a child, the child does not
exist, the parent waits then keeps going

5) Check the OS logs in /var/log (beyond my competence). You may have
too high a nfs load, bad infiniband/myrinet (recent OFED?) etc. Use
-assu buff in compilation options to reduce nfs load.

On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad both_hamad at yahoo.com wrote:
 Dear Wien users,

 When running optimisation jobs under torque queuing system for anything but
 very small systems:

 Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully 
 but eventually the 'mom-superior' node (that launches ) mpirun becomes 
 non-communicating with the other nodes involved with the job.

 At the console of this node there is correct load (4 for quad processor) and 
 memory free... but can no longer access any nfs mounts, can no longer ping 
 other nodes in cluster... am eventually forced to reboot node and kill job 
 from cluster queuing system (job enters 'E' state and stays there... need to 
 stop pbs_server and manually remove jobfiles from 
 /var/spool/torque/server_priv/jobs... then restart pbs_server)

 A similar problem is encountered on larger cluster (same install procedure) 
 but with added problem that the .dayfile reports that for lapw2 only the 
 'mom-superior' node is reporting doing work (even though logging into other 
 job nodes top reports correct load and 100%cpu use).

 DOS calculation seems to work properly on both clusters...

 We have used a modified x_lapw that you provided earlier.
 We have been inserting 'ulimit -s unlimited' into job-scripts

 We are using...
 Centos5.3 x86_64
 Intel compiler suite with mkl v11.1/072
 openmpi-1.4.2, compiled with intel compilers
 fftw-2.1.5, compiled with intel compilers and openmpi above
 Wien2k v10.1

 Optimisation jobs for small systems complete OK on both clusters.

 The working directories for this job are large (2GB).

 ?Please let us know what
 files we could send you from these that may be helpful for diagnosis...

 Best regards
 Bothina



 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien




-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.


[Wien] Problems in parallel jobs

2010-07-21 Thread Laurence Marks
Also:

5) Use ldd of $WIENROOT/lapw1_mpi on different nodes to check that
these are correct, and also which mpirun.

On Wed, Jul 21, 2010 at 6:47 AM, Laurence Marks
L-marks at northwestern.edu wrote:
 Hard to know for certain, but this looks like a OS problem rather than
 a WIen2k issue. Things to check:

 1) Use ompi_info and check, carefully, that the compilation
 options/libraries used for openmpi are the same as what you are using
 to compile Wien2k.

 2) For 10.1, ulimit -s is not needed for mpi (and in any case does
 nothing with openmpi) as this is done in software in Wien2kutils.c.
 Make sure that you are exporting environmental parameters in your
 mpirun call, for instance use in parallel_options
 setenv WIEN_MPIRUN mpirun -x LD_LIBRARY_PATH -x PATH -np _NP_
 -machinefile _HOSTS_ _EXEC_

 3) Check the size of the job you are running, e.g. via top, by looking
 in case.output1_X. by using lapw1 -p -nmat_only, using ganglia or
 nmon, cat /proc/meminfo (or anything else you have available).
 Particularly with openmpi but with some other flavors as well, if you
 are asking for too much memory and/or have too many processes running,
 problems occur. A race condition can also occur in openmpi which makes
 this problem worse (maybe patched in latest version, I am not sure).

 4) Check, carefully (twice) for format errors in the input files. It
 turns out that ifort has it's own signal traps so a child can exit
 without correctly calling mpi_abort. A race condition can occur with
 openmpi when the parent is trying to find a child, the child does not
 exist, the parent waits then keeps going

 5) Check the OS logs in /var/log (beyond my competence). You may have
 too high a nfs load, bad infiniband/myrinet (recent OFED?) etc. Use
 -assu buff in compilation options to reduce nfs load.

 On Wed, Jul 21, 2010 at 3:53 AM, bothina hamad both_hamad at yahoo.com 
 wrote:
 Dear Wien users,

 When running optimisation jobs under torque queuing system for anything but
 very small systems:

 Job runs for many cycles using lapw0, lapw1, lapw2 (parallel) successfully 
 but eventually the 'mom-superior' node (that launches ) mpirun becomes 
 non-communicating with the other nodes involved with the job.

 At the console of this node there is correct load (4 for quad processor) and 
 memory free... but can no longer access any nfs mounts, can no longer ping 
 other nodes in cluster... am eventually forced to reboot node and kill job 
 from cluster queuing system (job enters 'E' state and stays there... need to 
 stop pbs_server and manually remove jobfiles from 
 /var/spool/torque/server_priv/jobs... then restart pbs_server)

 A similar problem is encountered on larger cluster (same install procedure) 
 but with added problem that the .dayfile reports that for lapw2 only the 
 'mom-superior' node is reporting doing work (even though logging into other 
 job nodes top reports correct load and 100%cpu use).

 DOS calculation seems to work properly on both clusters...

 We have used a modified x_lapw that you provided earlier.
 We have been inserting 'ulimit -s unlimited' into job-scripts

 We are using...
 Centos5.3 x86_64
 Intel compiler suite with mkl v11.1/072
 openmpi-1.4.2, compiled with intel compilers
 fftw-2.1.5, compiled with intel compilers and openmpi above
 Wien2k v10.1

 Optimisation jobs for small systems complete OK on both clusters.

 The working directories for this job are large (2GB).

 ?Please let us know what
 files we could send you from these that may be helpful for diagnosis...

 Best regards
 Bothina



 ___
 Wien mailing list
 Wien at zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien




 --
 Laurence Marks
 Department of Materials Science and Engineering
 MSE Rm 2036 Cook Hall
 2220 N Campus Drive
 Northwestern University
 Evanston, IL 60208, USA
 Tel: (847) 491-3996 Fax: (847) 491-7820
 email: L-marks at northwestern dot edu
 Web: www.numis.northwestern.edu
 Chair, Commission on Electron Crystallography of IUCR
 www.numis.northwestern.edu/
 Electron crystallography is the branch of science that uses electron
 scattering and imaging to study the structure of matter.




-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Electron crystallography is the branch of science that uses electron
scattering and imaging to study the structure of matter.