Yes, I agree that it is an issue related to the compilation. I said that in my 2nd email of the 23rd of November. Still, I think it's worth having this problem reported and "solved" (at least in practice) in the forum. I suffered this error in an HPC center, compiled by specialists so other users may experience it as well.
2016-12-01 18:43 GMT+01:00 Paolo Giannozzi <p.gianno...@gmail.com>: > "underflows"? They should never be a problem, unless you instruct the > compiler (by activating some obscure flag) to catch them. > > Paolo > > On Thu, Dec 1, 2016 at 4:48 PM, Sergi Vela <sergi.v...@gmail.com> wrote: > >> Dear Paolo, >> >> I have some more details on the problem with DFT+U. The problem arises >> from underflows somewhere in QE code, hence the MPI_Bcast message described >> in previous emails. A systematic crash occurs for the attached input, at >> least, in versions 5.1.1, 5.2, 5.4 and 6.0. >> >> According to the support team of HPC-GRNET, the problem is not related to >> MPI (no matter if IntelMPI or OpenMPI - various versions for both) and it >> is not related to BLAS libraries (MKL, OpenBLAS). For Intel compilers, the >> flag "-fp-model precise" seems to be necessary (at least for 5.2 and 5.4). >> In turn, GNU compilers work. They also notice the underflow (a message >> appears in the job file after completion), but it seems that they can >> handle them. >> >> The attached input is just an example. Many other jobs of different >> systems have failed whereas other closely-related inputs have run without >> any problem. I have the impression that the underflow is not always >> occurring or, at least, is not always enough to crash the job. >> >> Right now I'm extensively using version 5.1.1 compiled with GNU/4.9 >> compiler and it seems to work well. >> >> That's all the info I can give you about the problem. I hope it may >> eventually help. >> >> Bests, >> Sergi >> >> >> >> 2016-11-23 16:13 GMT+01:00 Sergi Vela <sergi.v...@gmail.com>: >> >>> Dear Paolo, >>> >>> Unfortunately, there's not much to report so far. Many "relax" jobs for >>> a system of ca. 500 atoms (including Fe) fail giving the same message >>> Davide reported long time ago: >>> _________________ >>> >>> Fatal error in PMPI_Bcast: Other MPI error, error stack: >>> PMPI_Bcast(2434)........: MPI_Bcast(buf=0x8b25e30, count=7220, >>> MPI_DOUBLE_PRECISION, root=0, comm=0x84000007) failed >>> MPIR_Bcast_impl(1807)...: >>> MPIR_Bcast(1835)........: >>> I_MPIR_Bcast_intra(2016): Failure during collective >>> MPIR_Bcast_intra(1665)..: Failure during collective >>> _________________ >>> >>> It only occurs in some architectures. The same inputs work for me in 2 >>> other machines, so it seems to be related to the compilation. The support >>> team of the HPC center I'm working on is trying to identify the problem. >>> Also, it seems to occur randomly. In the sense that for some DFT+U >>> calculations of the same type (same cutoffs, pp's, system) there is no >>> problem at all. >>> >>> I'll try to be more helpful next time, and I'll keep you updated. >>> >>> Bests, >>> Sergi >>> >>> 2016-11-23 15:21 GMT+01:00 Paolo Giannozzi <p.gianno...@gmail.com>: >>> >>>> Thank you, but unless an example demonstrating the problem is provided, >>>> or at least some information on where this message come from is supplied, >>>> there is close to nothing that can be done >>>> >>>> Paolo >>>> >>>> On Wed, Nov 23, 2016 at 10:05 AM, Sergi Vela <sergi.v...@gmail.com> >>>> wrote: >>>> >>>>> Dear Colleagues, >>>>> >>>>> Just to report that I'm having exactly the same problem with DFT+U. >>>>> The same message is appearing randomly only when I use the Hubbard term. I >>>>> could test versions 5.2 and 6.0 and it occurs in both. >>>>> >>>>> All my best, >>>>> Sergi >>>>> >>>>> 2015-07-16 18:43 GMT+02:00 Paolo Giannozzi <p.gianno...@gmail.com>: >>>>> >>>>>> There are many well-known problems of DFT+U, but none that is known >>>>>> to crash jobs with an obscure message. >>>>>> >>>>>> Rank 21 [Thu Jul 16 15:51:04 2015] [c4-2c0s15n2] Fatal error in >>>>>>> PMPI_Bcast: Message truncated, error stack: >>>>>>> PMPI_Bcast(1615)..................: MPI_Bcast(buf=0x75265e0, >>>>>>> count=160, MPI_DOUBLE_PRECISION, root=0, comm=0xc4000000) failed >>>>>>> >>>>>> >>>>>> this signals a mismatch between what is sent and what is received in >>>>>> a broadcast operation. This may be due to an obvious bug, that however >>>>>> should show up at the first iteration, not after XX. Apart compiler or >>>>>> MPI >>>>>> library bugs, another reason is the one described in sec.8.3 of the >>>>>> developer manual: different processes following a different execution >>>>>> paths. From time to time, cases like this are found (the latest >>>>>> occurrence, in band parallelization of exact exchange) and easily fixed. >>>>>> Unfortunately, finding them (that is: where this happens) typically >>>>>> requires a painstaking parallel debugging. >>>>>> >>>>>> Paolo >>>>>> -- >>>>>> Paolo Giannozzi, Dept. Chemistry&Physics&Environment, >>>>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >>>>>> Phone +39-0432-558216, fax +39-0432-558222 >>>>>> >>>>>> _______________________________________________ >>>>>> Pw_forum mailing list >>>>>> Pw_forum@pwscf.org >>>>>> http://pwscf.org/mailman/listinfo/pw_forum >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Pw_forum mailing list >>>>> Pw_forum@pwscf.org >>>>> http://pwscf.org/mailman/listinfo/pw_forum >>>>> >>>> >>>> >>>> >>>> -- >>>> Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, >>>> Univ. Udine, via delle Scienze 208, 33100 Udine, Italy >>>> Phone +39-0432-558216, fax +39-0432-558222 >>>> >>>> >>>> _______________________________________________ >>>> Pw_forum mailing list >>>> Pw_forum@pwscf.org >>>> http://pwscf.org/mailman/listinfo/pw_forum >>>> >>> >>> >> >> _______________________________________________ >> Pw_forum mailing list >> Pw_forum@pwscf.org >> http://pwscf.org/mailman/listinfo/pw_forum >> > > > > -- > Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, > Univ. Udine, via delle Scienze 208, 33100 Udine, Italy > Phone +39-0432-558216 <+39%200432%20558216>, fax +39-0432-558222 > <+39%200432%20558222> > > > _______________________________________________ > Pw_forum mailing list > Pw_forum@pwscf.org > http://pwscf.org/mailman/listinfo/pw_forum >
_______________________________________________ Pw_forum mailing list Pw_forum@pwscf.org http://pwscf.org/mailman/listinfo/pw_forum