Re: [Wien] ‘lapw2 -so’ hangs

2013-11-12 Thread Elias Assmann

Hi,

Regarding my original problem, it has disappeared upon another “lapw0; 
lapw1; lapwso” cycle, only this time I first did a ‘clean’.  I guess 
something must have been left in an inconsistent state from previous 
calculations in that directory, and ‘clean’ removed the offending file.



Elias

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ‘lapw2 -so’ hangs

2013-11-11 Thread Elias Assmann

Dear Peter,

I have tried to narrow things down a bit.  The subroutine
‘fermi_tetra’ gets stuck in the loop labeled ‘14’.  Here is a
code snippet:

   498  4 K=K+1 


   499if(iloop0.ne.0) KPP(ILOOP0)=K
   500  !para begin
   501  ! testing
   502  !  write(*,*)'reading k=',K,itap,ispin,iloop,iloop0
   503  !para end
   504IF(K.GT.2*NKPT) GOTO 900 


   505READ(ITAP,5001,END=999) SS,T,ZZ,KNAME,N,NEHELP(K,ispin),WEI
   506nmat=MAX(n,nmat)
   507
   508  !para begin
   509NE(K+k1)=NEHELP(K,ispin)
   510  !para end
   511if(nehelp(k,ispin).gt.nume) GOTO 920
   512if(nemax.lt.nehelp(k,ispin)) nemax=nehelp(k,ispin)
   513IF(N.GT.MAXWAV) MAXWAV=N 

   514IF(N.LT.MINWAV) MINWAV=N 


   515 14 READ(ITAP,*) NUM,E1
   516Eb(num,K,ispin)=E1 


   517if(itap.eq.30.and.(e1.gt.ebmax(num))) ebmax(num)=e1
   518if(itap.eq.30.and.e1.lt.ebmin(num)) ebmin(num)=e1
   519  !  READ(ITAP) (A(I),I=1,N) 


   520IF(NUM.EQ.NEHELP(K,ispin)) GOTO 4
   521GOTO 14

I put a debug statement

  write(0,*) 'Hello ', k,ispin, nehelp(k,ispin), nume

before l. 511.  The last few lines of output look either like this:

 Hello 2434   2  54  60
 Hello 2435   2  54  60
 Hello 2436   2  56  60
 Hello 2437   2   0  60
 Hello 2438   2   0  60
 Hello 2439   2  1198992928  60
FERMI - Error

where an error is raised on l. 511, or like this:

 Hello 2434   2  54  60
 Hello 2435   2  54  60
 Hello 2436   2  56  60
 Hello 2437   2   0  60
 Hello 2438   2   0  60
 Hello 2439   2  -820289632  60

where the program goes into the infinite loop instead.

What happens is that the NEHELP array is too small, so the READ on
l. 505 fails and NEHELP(K,ispin) ends up containing uninitialized
data.  So I guess the problem stems from the ‘energysodn’ which is too
small, and I need to go look at what is going wrong in lapwso.

But I thought I should share this anyway.  In particular, I do not
understand what is going on with the SIGSEGVs the program gets.  They
would be caused by NEHELP being too small, but why doesn't the program
die?  The Wien2k signal handling is not invoked (since this is not
parallel); I do see a call

  rt_sigaction(SIGSEGV, {0x4d2480, [], 
SA_RESTORER|SA_RESTART|SA_NODEFER|SA_SIGINFO, 0x2b472c1f5ca0}, NULL, 8) = 0


in the trace, but ifort seems to do this even for the simplest test
program, and that does not prevent it from dying on a SIGSEGV.

Secondly, I thought the READ on l. 515 would raise an error on EOF;
instead it seems to “busy wait” (it never returns but keeps the CPU
usage at 100%).

What is more, I found that the problem interacts in a subtle way with
ifort's (V. 11.1) ‘-ipo’ and ‘-g’ switches.  Originally, I had ‘-ipo’
set, resulting in the infinite loop.  For debugging, I took that out
and added ‘-g’ instead, which resulted in the behavior described
above.

Turning ‘-ipo’ back on, the debug output looks like this:

 Hello 2434   2  54  60
 Hello 2435   2  54  60
 Hello 2436   2  56  60
 Hello 2437   2  56  60

and the infinite loop always happens.

When I use neither switch, the result is what I would normally expect:
the program dies from the segfault.

Summarizing, this is what I see:

-g -ipo : silent fail
   -ipo : silent fail
-g  : “FERMI - Error” / silent fail
: “normal“ segfault


Sorry for the overlong e-mail.

Elias
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ‘lapw2 -so’ hangs

2013-11-08 Thread Peter Blaha

I guess I would need to see the calculation myself.

PS: It is not the lack of disk-space ???

Am 08.11.2013 07:56, schrieb Elias Assmann:

Dear Peter,

On 11/07/2013 09:50 AM, Peter Blaha wrote:

energysoup and energysodn should be the same.


They are, up to a small difference in the header.


(case.energyup/dn could be larger because in lapw1 you may have a larger
E-window (more eigenvalues) than in case.inso


What puzzled me was that in the case that works normally, `energysodum´
and `energysodn´ are the same size; while in the broken case,
`energysodum´ is the same size again, but `energysodn´ is much smaller.


Have you looked into case.outputso and case.inso ?


$ cat Bi100.inso
WFFIL
4  0  0 llmax,ipr,kpot
-10  1.5Emin, Emax
 1 0 0   h,k,l (direction of magnetization)
  0   number of atoms with RLO
0 0  number of atoms without SO, atomnumbers

The only difference to the `inso´ for the non-broken case is the
magnetization direction.  The calculation ran fine for a long time
before this problem appeared, and I did not change the input file.

As for `outputso´, like I said, it goes through all 12008 k-vectors. But
in `energysodn´, only 2431 k-points are listed.  What else should I look
for in `outputso´?


Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong??


How many is sufficient?  What I can say is that the calculation ran fine
for a long time before this problem appeared, and I did not change the
input files.


 Elias
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--
Peter Blaha
Inst.Materials Chemistry
TU Vienna
Getreidemarkt 9
A-1060 Vienna
Austria
+43-1-5880115671
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ‘lapw2 -so’ hangs

2013-11-07 Thread Peter Blaha

You do not give the info for case.energysoup

energysoup and energysodn should be the same.
(case.energyup/dn could be larger because in lapw1 you may have a larger 
E-window (more eigenvalues) than in case.inso


Have you looked into case.outputso and case.inso ?

Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong??

On 11/07/2013 09:21 AM, Elias Assmann wrote:

Hi List,

I have two sp+SO calculations which are mostly identical, apart from
the fact that the magnetization directions are different.  Both cases
have worked fine, but now in one case, ‘lapw2’ does not finish.

In the ‘output2’ file, RECPR says

  generate new recprlist
   KXMAX,KYMAX,KZMAX  17  17  15
  3605 PLANE WAVES GENERATED (INCLUDING FORBIDDEN H,K,L)

  nwav1,kn   13605

but then the k-vector list only runs from KVEC( 1) to KVEC( 3484).

An ‘strace’ shows that ‘lapw2’ goes on to read the ‘energydum’ and
‘energyso’ files (fd 26 is ‘energydum’, 27 is ‘energysodn’; there is
some seeking in between)

   write(9,  Running LAPW2 in single process..., 7926) = 7926
   write(9,KVEC(   125) =-4 ..., 7980) = 7980
   …
   read(27, 199.25000200.20750198.72842  0.2..., 8192) = 8192
   read(26,199.25000   200.20814   198.7..., 8192) = 8192
   …
   read(27, , 8192)  = 0
   lseek(26, 0, SEEK_CUR)  = 5105419
   lseek(26, -7859, SEEK_CUR)  = 5097560
   lseek(26, 0, SEEK_SET)  = 0
   lseek(27, 0, SEEK_CUR)  = 5163641
   lseek(27, 0, SEEK_CUR)  = 5163641
   lseek(27, 0, SEEK_SET)  = 0
   mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b32cfd730
   00
   mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b32cffae0
   00
   mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b32d01e90
   00
   mmap(NULL, 2338816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x2b32d04240
   00
   read(27, 199.25000200.20750198.72842  0.2..., 8192) = 8192
   read(26,199.25000   200.20814   198.7..., 8192) = 8192
   …
   read(26, 956434490995 \n  50  ..., 8173) = 8173
   read(26,   37   1.97297910179952 \n   ..., 8184) = 8184
   read(26,   25   1.96075450056876 \n   ..., 8184) = 8184
   read(26,   13   1.96281528875338 \n   ..., 8184) = 3777
   read(26, , 8192)  = 0

EOF.  Now it opens a file containing error messages, but note that no
message is printed.


open(/opt/intel/Compiler/11.1/046/lib/intel64/locale/en_US/ifcore_msg.cat,
O_RDONLY) = 4
   fstat(4, {st_mode=S_IFREG|0664, st_size=30244, ...}) = 0
   mmap(NULL, 30244, PROT_READ, MAP_PRIVATE, 4, 0) = 0x2b32d065f000
   close(4)= 0

Then it gets a SIGSEGV

   --- SIGSEGV (Segmentation fault) @ 0 (0) ---
   rt_sigreturn(0xb)   = 47497226185184

but it does not die (SIGSEGV was trapped earlier).  Instead, the last
two lines are repeated ad infinitum.  This behavior occurs without
parallelization, and whether I with ‘-fermi’, ‘-qtl’, or without those
flags.

This seems to be caused by an incomplete ‘energysodn’ file.  In the
problematic case:

$ wc Bi100.energy{dn,sodn,dum}
1421538  2903129 53209476 Bi100.energydn
-  136400   284967  5163641 Bi100.energysodn
 673789  1407639 25506805 Bi100.energydum

while in the case with the other magnetization direction:

$ wc Bi010.energy{dn,sodn,dum}
 1421538   2903129  53209476 Bi010.energydn
-   673786   1407624  25506619 Bi010.energysodn
  673789   1407639  25506805 Bi010.energydum

But I have no idea why this happens.  I have certainly tried “lapw0;
lapw1; lapwso” several times, even with different ‘clm’s (from various
saves).

The ‘outputso’ files in both cases run up to “K=12008”, and prints
“TOTAL NUMBER OF K-POINTS:   12008” at the end.

Any pointers?


 Elias
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--

  P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.atWWW: 
http://info.tuwien.ac.at/theochem/

--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] ‘lapw2 -so’ hangs

2013-11-07 Thread Elias Assmann

Dear Peter,

On 11/07/2013 09:50 AM, Peter Blaha wrote:

energysoup and energysodn should be the same.


They are, up to a small difference in the header.


(case.energyup/dn could be larger because in lapw1 you may have a larger
E-window (more eigenvalues) than in case.inso


What puzzled me was that in the case that works normally, `energysodum´ 
and `energysodn´ are the same size; while in the broken case, 
`energysodum´ is the same size again, but `energysodn´ is much smaller.



Have you looked into case.outputso and case.inso ?


$ cat Bi100.inso
WFFIL
4  0  0 llmax,ipr,kpot
-10  1.5Emin, Emax
1 0 0   h,k,l (direction of magnetization)
 0   number of atoms with RLO
0 0  number of atoms without SO, atomnumbers

The only difference to the `inso´ for the non-broken case is the 
magnetization direction.  The calculation ran fine for a long time 
before this problem appeared, and I did not change the input file.


As for `outputso´, like I said, it goes through all 12008 k-vectors. 
But in `energysodn´, only 2431 k-points are listed.  What else should I 
look for in `outputso´?



Do you get sufficient eigenvalues ?? Maybe E-max in case.inso is wrong??


How many is sufficient?  What I can say is that the calculation ran fine 
for a long time before this problem appeared, and I did not change the 
input files.



Elias
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html