Re: [Wien] ‘lapw2 -so’ hangs
Hi, Regarding my original problem, it has disappeared upon another “lapw0; lapw1; lapwso” cycle, only this time I first did a ‘clean’. I guess something must have been left in an inconsistent state from previous calculations in that directory, and ‘clean’ removed the offending file. Elias ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ‘lapw2 -so’ hangs
Dear Peter, I have tried to narrow things down a bit. The subroutine ‘fermi_tetra’ gets stuck in the loop labeled ‘14’. Here is a code snippet: 498 4 K=K+1 499if(iloop0.ne.0) KPP(ILOOP0)=K 500 !para begin 501 ! testing 502 ! write(*,*)'reading k=',K,itap,ispin,iloop,iloop0 503 !para end 504IF(K.GT.2*NKPT) GOTO 900 505READ(ITAP,5001,END=999) SS,T,ZZ,KNAME,N,NEHELP(K,ispin),WEI 506nmat=MAX(n,nmat) 507 508 !para begin 509NE(K+k1)=NEHELP(K,ispin) 510 !para end 511if(nehelp(k,ispin).gt.nume) GOTO 920 512if(nemax.lt.nehelp(k,ispin)) nemax=nehelp(k,ispin) 513IF(N.GT.MAXWAV) MAXWAV=N 514IF(N.LT.MINWAV) MINWAV=N 515 14 READ(ITAP,*) NUM,E1 516Eb(num,K,ispin)=E1 517if(itap.eq.30.and.(e1.gt.ebmax(num))) ebmax(num)=e1 518if(itap.eq.30.and.e1.lt.ebmin(num)) ebmin(num)=e1 519 ! READ(ITAP) (A(I),I=1,N) 520IF(NUM.EQ.NEHELP(K,ispin)) GOTO 4 521GOTO 14 I put a debug statement write(0,*) 'Hello ', k,ispin, nehelp(k,ispin), nume before l. 511. The last few lines of output look either like this: Hello 2434 2 54 60 Hello 2435 2 54 60 Hello 2436 2 56 60 Hello 2437 2 0 60 Hello 2438 2 0 60 Hello 2439 2 1198992928 60 FERMI - Error where an error is raised on l. 511, or like this: Hello 2434 2 54 60 Hello 2435 2 54 60 Hello 2436 2 56 60 Hello 2437 2 0 60 Hello 2438 2 0 60 Hello 2439 2 -820289632 60 where the program goes into the infinite loop instead. What happens is that the NEHELP array is too small, so the READ on l. 505 fails and NEHELP(K,ispin) ends up containing uninitialized data. So I guess the problem stems from the ‘energysodn’ which is too small, and I need to go look at what is going wrong in lapwso. But I thought I should share this anyway. In particular, I do not understand what is going on with the SIGSEGVs the program gets. They would be caused by NEHELP being too small, but why doesn't the program die? The Wien2k signal handling is not invoked (since this is not parallel); I do see a call rt_sigaction(SIGSEGV, {0x4d2480, [], SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x2b472c1f5ca0}, NULL, 8) = 0 in the trace, but ifort seems to do this even for the simplest test program, and that does not prevent it from dying on a SIGSEGV. Secondly, I thought the READ on l. 515 would raise an error on EOF; instead it seems to “busy wait” (it never returns but keeps the CPU usage at 100%). What is more, I found that the problem interacts in a subtle way with ifort's (V. 11.1) ‘-ipo’ and ‘-g’ switches. Originally, I had ‘-ipo’ set, resulting in the infinite loop. For debugging, I took that out and added ‘-g’ instead, which resulted in the behavior described above. Turning ‘-ipo’ back on, the debug output looks like this: Hello 2434 2 54 60 Hello 2435 2 54 60 Hello 2436 2 56 60 Hello 2437 2 56 60 and the infinite loop always happens. When I use neither switch, the result is what I would normally expect: the program dies from the segfault. Summarizing, this is what I see: -g -ipo : silent fail -ipo : silent fail -g : “FERMI - Error” / silent fail : “normal“ segfault Sorry for the overlong e-mail. Elias ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ‘lapw2 -so’ hangs
I guess I would need to see the calculation myself. PS: It is not the lack of disk-space ??? Am 08.11.2013 07:56, schrieb Elias Assmann: Dear Peter, On 11/07/2013 09:50 AM, Peter Blaha wrote: energysoup and energysodn should be the same. They are, up to a small difference in the header. (case.energyup/dn could be larger because in lapw1 you may have a larger E-window (more eigenvalues) than in case.inso What puzzled me was that in the case that works normally, `energysodum´ and `energysodn´ are the same size; while in the broken case, `energysodum´ is the same size again, but `energysodn´ is much smaller. Have you looked into case.outputso and case.inso ? $ cat Bi100.inso WFFIL 4 0 0 llmax,ipr,kpot -10 1.5Emin, Emax 1 0 0 h,k,l (direction of magnetization) 0 number of atoms with RLO 0 0 number of atoms without SO, atomnumbers The only difference to the `inso´ for the non-broken case is the magnetization direction. The calculation ran fine for a long time before this problem appeared, and I did not change the input file. As for `outputso´, like I said, it goes through all 12008 k-vectors. But in `energysodn´, only 2431 k-points are listed. What else should I look for in `outputso´? Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong?? How many is sufficient? What I can say is that the calculation ran fine for a long time before this problem appeared, and I did not change the input files. Elias ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- Peter Blaha Inst.Materials Chemistry TU Vienna Getreidemarkt 9 A-1060 Vienna Austria +43-1-5880115671 ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ‘lapw2 -so’ hangs
You do not give the info for case.energysoup energysoup and energysodn should be the same. (case.energyup/dn could be larger because in lapw1 you may have a larger E-window (more eigenvalues) than in case.inso Have you looked into case.outputso and case.inso ? Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong?? On 11/07/2013 09:21 AM, Elias Assmann wrote: Hi List, I have two sp+SO calculations which are mostly identical, apart from the fact that the magnetization directions are different. Both cases have worked fine, but now in one case, ‘lapw2’ does not finish. In the ‘output2’ file, RECPR says generate new recprlist KXMAX,KYMAX,KZMAX 17 17 15 3605 PLANE WAVES GENERATED (INCLUDING FORBIDDEN H,K,L) nwav1,kn 13605 but then the k-vector list only runs from KVEC( 1) to KVEC( 3484). An ‘strace’ shows that ‘lapw2’ goes on to read the ‘energydum’ and ‘energyso’ files (fd 26 is ‘energydum’, 27 is ‘energysodn’; there is some seeking in between) write(9, Running LAPW2 in single process..., 7926) = 7926 write(9,KVEC( 125) =-4 ..., 7980) = 7980 … read(27, 199.25000200.20750198.72842 0.2..., 8192) = 8192 read(26,199.25000 200.20814 198.7..., 8192) = 8192 … read(27, , 8192) = 0 lseek(26, 0, SEEK_CUR) = 5105419 lseek(26, -7859, SEEK_CUR) = 5097560 lseek(26, 0, SEEK_SET) = 0 lseek(27, 0, SEEK_CUR) = 5163641 lseek(27, 0, SEEK_CUR) = 5163641 lseek(27, 0, SEEK_SET) = 0 mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b32cfd730 00 mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b32cffae0 00 mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b32d01e90 00 mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b32d04240 00 read(27, 199.25000200.20750198.72842 0.2..., 8192) = 8192 read(26,199.25000 200.20814 198.7..., 8192) = 8192 … read(26, 956434490995 \n 50 ..., 8173) = 8173 read(26, 37 1.97297910179952 \n ..., 8184) = 8184 read(26, 25 1.96075450056876 \n ..., 8184) = 8184 read(26, 13 1.96281528875338 \n ..., 8184) = 3777 read(26, , 8192) = 0 EOF. Now it opens a file containing error messages, but note that no message is printed. open(/opt/intel/Compiler/11.1/046/lib/intel64/locale/en_US/ifcore_msg.cat, O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0664, st_size=30244, ...}) = 0 mmap(NULL, 30244, PROT_READ, MAP_PRIVATE, 4, 0) = 0x2b32d065f000 close(4)= 0 Then it gets a SIGSEGV --- SIGSEGV (Segmentation fault) @ 0 (0) --- rt_sigreturn(0xb) = 47497226185184 but it does not die (SIGSEGV was trapped earlier). Instead, the last two lines are repeated ad infinitum. This behavior occurs without parallelization, and whether I with ‘-fermi’, ‘-qtl’, or without those flags. This seems to be caused by an incomplete ‘energysodn’ file. In the problematic case: $ wc Bi100.energy{dn,sodn,dum} 1421538 2903129 53209476 Bi100.energydn - 136400 284967 5163641 Bi100.energysodn 673789 1407639 25506805 Bi100.energydum while in the case with the other magnetization direction: $ wc Bi010.energy{dn,sodn,dum} 1421538 2903129 53209476 Bi010.energydn - 673786 1407624 25506619 Bi010.energysodn 673789 1407639 25506805 Bi010.energydum But I have no idea why this happens. I have certainly tried “lapw0; lapw1; lapwso” several times, even with different ‘clm’s (from various saves). The ‘outputso’ files in both cases run up to “K=12008”, and prints “TOTAL NUMBER OF K-POINTS: 12008” at the end. Any pointers? Elias ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html -- P.Blaha -- Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna Phone: +43-1-58801-165300 FAX: +43-1-58801-165982 Email: bl...@theochem.tuwien.ac.atWWW: http://info.tuwien.ac.at/theochem/ -- ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
Re: [Wien] ‘lapw2 -so’ hangs
Dear Peter, On 11/07/2013 09:50 AM, Peter Blaha wrote: energysoup and energysodn should be the same. They are, up to a small difference in the header. (case.energyup/dn could be larger because in lapw1 you may have a larger E-window (more eigenvalues) than in case.inso What puzzled me was that in the case that works normally, `energysodum´ and `energysodn´ are the same size; while in the broken case, `energysodum´ is the same size again, but `energysodn´ is much smaller. Have you looked into case.outputso and case.inso ? $ cat Bi100.inso WFFIL 4 0 0 llmax,ipr,kpot -10 1.5Emin, Emax 1 0 0 h,k,l (direction of magnetization) 0 number of atoms with RLO 0 0 number of atoms without SO, atomnumbers The only difference to the `inso´ for the non-broken case is the magnetization direction. The calculation ran fine for a long time before this problem appeared, and I did not change the input file. As for `outputso´, like I said, it goes through all 12008 k-vectors. But in `energysodn´, only 2431 k-points are listed. What else should I look for in `outputso´? Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong?? How many is sufficient? What I can say is that the calculation ran fine for a long time before this problem appeared, and I did not change the input files. Elias ___ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html