Dear Paolo, I have some more details on the problem with DFT+U. The problem arises from underflows somewhere in QE code, hence the MPI_Bcast message described in previous emails. A systematic crash occurs for the attached input, at least, in versions 5.1.1, 5.2, 5.4 and 6.0.
According to the support team of HPC-GRNET, the problem is not related to MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4). In turn, GNU compilers work. They also notice the underflow (a message appears in the job file after completion), but it seems that they can handle them. The attached input is just an example. Many other jobs of different systems have failed whereas other closely-related inputs have run without any problem. I have the impression that the underflow is not always occurring or, at least, is not always enough to crash the job. Right now I'm extensively using version 5.1.1 compiled with GNU/4.9 compiler and it seems to work well. That's all the info I can give you about the problem. I hope it may eventually help. Bests, Sergi 2016-11-23 16:13 GMT+01:00 Sergi Vela <sergi.v...@gmail.com>: > Dear Paolo, > > Unfortunately, there's not much to report so far. Many "relax" jobs for a > system of ca. 500 atoms (including Fe) fail giving the same message Davide > reported long time ago: > _________________ > > Fatal error in PMPI_Bcast: Other MPI error, error stack: > PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220, > MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed > MPIR_Bcast_impl(1807)...: > MPIR_Bcast(1835)........: > I_MPIR_Bcast_intra(2016): Failure during collective > MPIR_Bcast_intra(1665)..: Failure during collective > _________________ > > It only occurs in some architectures. The same inputs work for me in 2 > other machines, so it seems to be related to the compilation. The support > team of the HPC center I'm working on is trying to identify the problem. > Also, it seems to occur randomly. In the sense that for some DFT+U > calculations of the same type (same cutoffs, pp's, system) there is no > problem at all. > > I'll try to be more helpful next time, and I'll keep you updated. > > Bests, > Sergi > > 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <p.gianno...@gmail.com>: > >> Thank you, but unless an example demonstrating the problem is provided, >> or at least some information on where this message come from is supplied, >> there is close to nothing that can be done >> >> Paolo >> >> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <sergi.v...@gmail.com> >> wrote: >> >>> Dear Colleagues, >>> >>> Just to report that I'm having exactly the same problem with DFT+U. The >>> same message is appearing randomly only when I use the Hubbard term. I >>> could test versions 5.2 and 6.0 and it occurs in both. >>> >>> All my best, >>> Sergi >>> >>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <p.gianno...@gmail.com>: >>> >>>> There are many well-known problems of DFT+U, but none that is known to >>>> crash jobs with an obscure message. >>>> >>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in >>>>> PMPI_Bcast: Message truncated, error stack: >>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0, >>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed >>>>> >>>> >>>> this signals a mismatch between what is sent and what is received in a >>>> broadcast operation. This may be due to an obvious bug, that however should >>>> show up at the first iteration, not after XX. Apart compiler or MPI library >>>> bugs, another reason is the one described in sec.8.3 of the developer >>>> manual: different processes following a different execution paths. From >>>> time to time, cases like this are found (the latest occurrence, in band >>>> parallelization of exact exchange) and easily fixed. Unfortunately, finding >>>> them (that is: where this happens) typically requires a painstaking >>>> parallel debugging. >>>> >>>> Paolo >>>> -- >>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment, >>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >>>> Phone +39-0432-558216, fax +39-0432-558222 >>>> >>>> _______________________________________________ >>>> Pw_forum mailing list >>>> Pw_forum@pwscf.org >>>> http://pwscf.org/mailman/listinfo/pw_forum >>>> >>> >>> >>> _______________________________________________ >>> Pw_forum mailing list >>> Pw_forum@pwscf.org >>> http://pwscf.org/mailman/listinfo/pw_forum >>> >> >> >> >> -- >> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, >> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >> Phone +39-0432-558216, fax +39-0432-558222 >> >> >> _______________________________________________ >> Pw_forum mailing list >> Pw_forum@pwscf.org >> http://pwscf.org/mailman/listinfo/pw_forum >> > >
Description: Binary data
_______________________________________________ Pw_forum mailing list Pw_forum@pwscf.org http://pwscf.org/mailman/listinfo/pw_forum