Re: [Wien] System configuration

Pavel Ondračka Thu, 23 May 2019 11:34:05 -0700

Hi Indranil,




I'm sending this again this time also to the list (haven't noticed you
removed it), in the hope it might be useful for someone optimizing with 
gfortran as well...




Pavel




"Well,

first we need to figure out why is your serial lapw so slow...
You definitely don't have the libmvec patches, however almost two min
runtime suggest that even your BLAS might be bad?

In the test_case folder run:
$ grep "TIME HAMILT" test_case.output1
and post the output. Also please go to the Wien2k folder and send the
output of
$ cat WIEN2k_OPTION
and
$ ldd lapw1

Next Wien2k version will have this simplified, however for now some
patching needs to be to be done. The other option would be to get MKL
and ifort from Intel and use it instead...

Anyway if you don't want MKL, you need to download the attached patch
to the SRC_lapw1 folder in Wien2k base folder.
Go to the folder, and apply the patch with (you might need the patch
package for that)
$ patch -p1 < lapw1.patch
then set the FOPT compile flags via siteconfig to:
-ffree-form -O2 -ffree-line-length-none -march=native -ftree-vectorize 
-DHAVE_LIBMVEC -fopenmp
and recompile lapw1.
Now when you do again
$ ldd lapw1
it should show line with "libmvec.so.1 => /lib64/libmvec.so.1"

Compare timings again with the test_case.
Also try:
$ OMP_NUM_THREADS=2 x lapw1
$ OMP_NUM_THREADS=4 x lapw1

And after each run show total timings as well as
$ grep "TIME HAMILT" test_case.output1
Hopefully, you are already linking the multithreaded Openblas (but
dunno what is the Ubuntu default)...

I'll help you with the parallel execution in the next step.

Best regards
Pavel

On Thu, 2019-05-23 at 18:58 +0530, Indranil mal wrote:
> Dear sir
>
> After running x lapw1 I got the following
>
> ~/test_case$ x lapw1
> STOP LAPW1 END
> 114.577u 0.247s 1:54.82 99.9% 0+0k 0+51864io 0pf+0w
>
> I am using parallel k point execution only 8 GB memory is in use and
> for 100 atom (100 kpoints) calculation it is taking around 12 hours
> to complete one cycle.
> please help me.
>
> Thanking you
>
> Indranil
>
> On Thu, May 23, 2019 at 11:22 AM Pavel Ondračka <
> pavel.ondra...@email.cz> wrote:
> > Hi Indranil,
> >
> > While the k-point parallelization is usually the most efficient
> > (provided you have sufficient number of k-points) and does not need 
> > any
> > extra libraries, for 100atoms case it might be problematic to fit
> > 12
> > processes into 32GB of memory. I assume you are already using it
> > since
> > you claim to run on two cores?
> >
> > Instead check what is the maximum memory requirement of lapw1 when
> > run
> > in serial and based on that find how much processes you can run in
> > parallel, than for each place one line "1:localhost" into .machines 
> > file (there is no need to copy .machines from templates, or use
> > random
> > scripts, instead read the userguide to understand what you are
> > doing,
> > it will save you time in the long run). If you can run at least few 
> > k-
> > points in parallel it might be enough to speed it up significantly. 
> >
> > For MPI you would need openmpi-devel scalapack-devel and fftw3-
> > devel
> > (I'm not sure how exactly are they named on Ubuntu) packages.
> > Especially the scalapack configuration could be tricky, it is
> > probably
> > easiest to start with lapw0 as this needs only MPI and fftw.
> >
> > Also based on my experience with default gfortran settings, it is
> > likely that you don't have even optimized the single core
> > performance,
> > try to download the serial benchmark
> > http://susi.theochem.tuwien.ac.at/reg_user/benchmark/test_case.tar.gz 
> > untar, run x lapw1 and report timings (on average i7 CPU it should
> > take
> > below 30 seconds, if it takes significantly more, you will need
> > some
> > more tweaks).
> >
> > Best regards
> > Pavel
> >
> > On Thu, 2019-05-23 at 10:42 +0530, Dr. K. C. Bhamu wrote:
> > > Hii,
> > >
> > > If you are doing k-point parallel calculation (having number of
> > k-
> > > points in IBZ more then 12) then use below script on terminal
> > where
> > > you want to run the calculation or use in your job script with
> > -p
> > > option in run(sp)_lapw (-so).
> > >
> > > if anyone knows how to repeat a nth line m times in a file then
> > this
> > > script can be changed.
> > >
> > > Below script simply coping machine file from temple directory and 
> > > updating it as per your need.
> > > So you do not need copy it, open it in your favorite editor and
> > do it
> > > manually.
> > >
> > > cp $WIENROOT/SRC_templates/.machines . ; grep localhost .machines 
> > |
> > > perl -ne 'print $_ x 6' > LOCALHOST.dat ; tail -n 2 .machines >
> > > grang.dat ; sed '22,25d' .machines > MACHINE.dat ; cat
> > MACHINE.dat
> > > localhost.dat grang.dat > .machines ; rm LOCALHOST.dat
> > MACHINE.dat
> > > grang.dat
> > >
> > > regards
> > > Bhamu
> > >
> > >
> > > On Wed, May 22, 2019 at 10:52 PM Indranil mal <
> > indranil....@gmail.com
> > > > wrote:
> > > > respected sir/ Users,
> > > > I am using a PC with intel i7 8th gen (with
> > 12
> > > > cores) 32GB RAM and 2TB HDD with UBUNTU 18.04 LTS. I have
> > installed
> > > > OpenBLAS-0.2.20 and using GNU FORTRAN and c compiler. I am
> > trying
> > > > to run a system with 100 atoms only two cores are using the
> > rest of
> > > > them are idle and the calculation taking a too long time. I
> > have
> > > > not installed mpi ScaLAPACK or elpa. Please help me what should 
> > I
> > > > do to utilize all of the cores of my cpu.
> > > >
> > > >
> > > >
> > > > Thanking you
> > > >
> > > > Indranil
> > > > _______________________________________________
> > > > Wien mailing list
> > > > Wien@zeus.theochem.tuwien.ac.at
> > > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > > > SEARCH the MAILING-LIST at:
> > > >
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> > >
> > > _______________________________________________
> > > Wien mailing list
> > > Wien@zeus.theochem.tuwien.ac.at
> > > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > > SEARCH the MAILING-LIST at:
> > >
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
> >
> > _______________________________________________
> > Wien mailing list
> > Wien@zeus.theochem.tuwien.ac.at
> > http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> > SEARCH the MAILING-LIST at:
> > http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
"

commit 83014e72861208c354cb470a41cb41e296d3aa29
Author: Pavel Ondračka <pavel.ondra...@gmail.com>
Date:   Wed Oct 31 10:35:19 2018 +0100

    lapw1: OMP

diff --git a/Makefile b/Makefile
index 0f61796..a8340f7 100644
--- a/Makefile
+++ b/Makefile
@@ -104,7 +104,7 @@ OBJS =	\
 	ph.o prtkpt.o prtres.o pzheevx16.o rdswar.o \
 	rint13.o rotate.o rotdef.o seclit.o seclr4.o seclr5.o \
 	select.o service.o setkpt.o setwar.o sphbes.o stern.o SymmRot.o \
-	tapewf.o t3j.o t3j0.o ustphx.o vectf.o warpin.o wfpnt.o wfpnt1.o \
+	tapewf.o t3j.o t3j0.o ustphx.o warpin.o wfpnt.o wfpnt1.o \
 	ylm.o zhcgst.o zheevx2.o zher2m.o \
 	jacdavblock.o make_albl.o global2local.o \
 	par_syrk.o my_dsygst.o refblas_dtrsm.o seclit_par.o \
diff --git a/Makefile.orig b/Makefile.orig
index 1a18233..89b8329 100644
--- a/Makefile.orig
+++ b/Makefile.orig
@@ -104,7 +104,7 @@ OBJS =	\
 	ph.o prtkpt.o prtres.o pzheevx16.o rdswar.o \
 	rint13.o rotate.o rotdef.o seclit.o seclr4.o seclr5.o \
 	select.o service.o setkpt.o setwar.o sphbes.o stern.o SymmRot.o \
-	tapewf.o t3j.o t3j0.o ustphx.o vectf.o warpin.o wfpnt.o wfpnt1.o \
+	tapewf.o t3j.o t3j0.o ustphx.o warpin.o wfpnt.o wfpnt1.o \
 	ylm.o zhcgst.o zheevx2.o zher2m.o \
 	jacdavblock.o make_albl.o global2local.o \
 	par_syrk.o my_dsygst.o refblas_dtrsm.o seclit_par.o \
diff --git a/hamilt.F b/hamilt.F
index e84b259..23351ab 100644
--- a/hamilt.F
+++ b/hamilt.F
@@ -22,6 +22,7 @@
       use struk, only : NDF,multmax
       use out, only   : WARP
       use albl
+      use stdmath
 
 !      use out, only   : WARP1, WARP2, WARP3
       IMPLICIT NONE
@@ -44,15 +45,16 @@
 !
 !        Arguments
 !
-      INTEGER            INS, LTOP, NAT, NV
+      INTEGER            INS, LTOP
+      INTEGER, INTENT(IN) :: NV, NAT
       INTEGER            MULT(NAT)
       DOUBLE PRECISION   VOLUME
-      DOUBLE PRECISION   R(NAT)
+      DOUBLE PRECISION, INTENT(IN) :: R(NAT)
 !      DOUBLE PRECISION   XK(HSROWS+1), YK(HSROWS+1), ZK(HSROWS+1)
       DOUBLE PRECISION   CHN(0:LMAX-1,NAT), DU(0:LMAX-1,NAT)
       DOUBLE PRECISION   DUP(0:LMAX-1,NAT), E(0:LMAX-1,NAT)
-      DOUBLE PRECISION   POS(3,NDF)
-      DOUBLE PRECISION   U(0:LMAX-1,NAT), UP(0:LMAX-1,NAT), V(NAT)
+      DOUBLE PRECISION, INTENT(IN) :: POS(3,NDF), V(NAT)
+      DOUBLE PRECISION   U(0:LMAX-1,NAT), UP(0:LMAX-1,NAT)
 
 !        YL(lm,:,i) - temporary storage of spherical harmonics for
 !                     local orbitals
@@ -77,7 +79,7 @@
       DOUBLE PRECISION   C6R, C6I, C7R, C7I, C8R, C8I
       DOUBLE PRECISION   CABR, CABI, CABCR, CABCI, CLR, CLI
       DOUBLE PRECISION   CLYLR, CLYLI, CLYLNR, CLYLNI
-      DOUBLE PRECISION   DLL1, PI4R2V
+      DOUBLE PRECISION   DLL1, PI4R2V, DEN
       DOUBLE PRECISION   DTIME1, DTIME2, DTIMH, DTIMLG, DTIMMA, DTIMPH,DTIMDIST
       DOUBLE PRECISION   DTIME3, dtime240, dtime241, dtime242
       DOUBLE PRECISION   dtime260, dtime211, dtime231, dtime210,dtime230,time_albl_tot
@@ -85,20 +87,24 @@
       DOUBLE PRECISION   DTIMS, DTIMUS, PI, Q, RKN, TMP,  VIFPR4, rk_tmp1
 
       DOUBLE PRECISION, allocatable :: C3R(:),C13R(:),C3RUP(:), C3IUP(:),C13RUP(:),C13IUP(:)   
-      DOUBLE PRECISION, allocatable ::   DEN(:), DFJ(:), FJ(:), FK(:), rk_tmp(:)
-      DOUBLE PRECISION, allocatable ::   PPLX(:), PPLY(:), ROTV1(:)
-      DOUBLE PRECISION, allocatable ::   ROTV2(:), SLEN(:,:), TMP1(:)
-      DOUBLE PRECISION, allocatable ::   TMP2(:), VEC(:)
+      DOUBLE PRECISION, allocatable ::   rk_tmp(:)
+      DOUBLE PRECISION, allocatable ::   PPLX(:), PPLY(:)
+      DOUBLE PRECISION, allocatable ::   SLEN(:,:), TMP1(:), TMP2(:)
+      DOUBLE PRECISION, dimension(3) :: FK, ROTV1, ROTV2, VEC
       DOUBLE PRECISION, allocatable ::   X2(:,:), XL(:),x(:)
       DOUBLE PRECISION, allocatable ::   P(:,:,:)
-      DOUBLE PRECISION, allocatable ::   help1(:), help_cos(:)
       DOUBLE PRECISION              ::   xk_tmp_c,yk_tmp_c,zk_tmp_c
       DOUBLE PRECISION, allocatable ::   xk_tmp_r(:),yk_tmp_r(:),zk_tmp_r(:)
+#if defined (INTEL_VML)
 !_COMPLEX      DOUBLE COMPLEX, allocatable ::     help_exp(:)
+!_REAL      DOUBLE PRECISION ::     help_exp
+      DOUBLE PRECISION, allocatable ::   help1(:), help_cos(:)
       DOUBLE PRECISION, allocatable ::   help_sin(:), tmp_y(:)
-#if defined (INTEL_VML)
 !!_COMPLEX      DOUBLE PRECISION, allocatable ::   help_tmpsin(:), help_tmpcos(:)
       DOUBLE PRECISION, allocatable ::   tmp_ytmp(:)
+#else
+      DOUBLE PRECISION              :: help_sin, help1, help_cos, tmp_y, tmp_ytmp
+      DOUBLE COMPLEX                :: help_exp
 #endif
 !_REAL      DOUBLE PRECISION ::   PHS
 !_REAL      DOUBLE PRECISION, allocatable ::   PHASE(:),spanelus(:,:)
@@ -133,11 +139,6 @@
 
       CALL BARRIER
 
-!
-      nloat=nloat_new
-      allocate(C3R(nloat),C13R(nloat),C3RUP(nloat),C3IUP(nloat),C13RUP(nloat),C13IUP(nloat))   
-      allocate( DFJ(0:LMAX-1), FJ(0:LMAX-1), FK(3) )
-      allocate( ROTV1(3),ROTV2(3),VEC(3)  )
       allocate( XL(LMAX-2) )
 
 #ifdef Parallel
@@ -148,30 +149,10 @@
       HSdim_r=HSROWS
 #endif
 
-      allocate( HPANEL( HSdim_r, BLOCKSIZE) )
-      allocate( SPANEL( HSdim_r, BLOCKSIZE) )
-      allocate( DEN(HSdim_r), rk_tmp(HSdim_r) )
-      allocate( PPLX(HSdim_r), PPLY(HSdim_r) )
-      allocate( SLEN(HSdim_r,blocksize), TMP1(HSdim_r) )
-      allocate( TMP2(HSdim_r) )
+      allocate( rk_tmp(HSdim_r) )
       allocate( kzz_TMP(3,HSdim_r) )
-      allocate( kzz_ihelp(3,blocksize) )
-      allocate( X2(HSdim_r,blocksize), X(HSdim_r) )
-      allocate( P(HSdim_r,0:LTOP-1,blocksize) )
-      allocate( help1(HSdim_r), help_cos(HSdim_r), help_sin(HSdim_r) )
-#if defined (INTEL_VML)
-!!_COMPLEX      allocate( help_tmpcos(HSdim_r), help_tmpsin(HSdim_r) )
-      allocate( tmp_ytmp(HSdim_r) )
-#endif
-      allocate( tmp_y(HSdim_r) )
-!_COMPLEX      allocate( help_exp(HSdim_r) )
-      allocate( PHASE(HSdim_r) )
-      allocate(spanelus(HSdim_r,blocksize))
-      allocate(j_end(blocksize),j_end_1(blocksize))
       allocate(xk_tmp_r(HSdim_r),yk_tmp_r(HSdim_r),zk_tmp_r(HSdim_r))
 
-      CALL init_albl(ltop-1,'HAM')
-
 !_REAL     write(6,'(a20,f12.1,a,a,2i6)') 'allocate spanel',HSdim_r*blocksize*8.0/(1024.0*1024.0),' MB ',&
 !_REAL           '         dimensions',HSdim_r,blocksize
 !_REAL     write(6,'(a20,f12.1,a,a,2i6)') 'allocate hpanel',HSdim_r*blocksize*8.0/(1024.0*1024.0),' MB ',&
@@ -236,8 +217,64 @@
          rk_tmp(j)=RK(J_g)
       enddo
 
+!$omp parallel private(atmp,atmp1,btmp,btmp1,dll1,j,tmp3,stmp,htmp, &
+!$omp& tmp1,tmp2,ihelp, phase, ins, &
+!$omp& jeq, indatm,argxx,argx, help1, help_cos, help_sin, help_exp, &
+!$omp& v3,tmp_y,tmp_ytmp,besr,ipr,i,j_g,fk, ipc, imin, imax, &
+!$omp& j_end_1,j_end, kzz_ihelp,xk_tmp_c,yk_tmp_c,zk_tmp_c, &
+!$omp& rk_tmp1,x2,slen,x,den,pplx,pply,p,k,spanel,spanelus,hpanel, &
+!$omp& jneq,INDATM0,vifpr4) &
+!$omp& reduction(max:DTIMPH,DTIMS,DTIMH,DTIMUS,DTIMMA,time_albl_tot,DTIMLG, &
+!$omp& DTIMPHw,DTIMUSw,DTIMMAw,time_albl_totw,DTIMLGw,DTIMDIST,DTIMDISTw)
+
+      allocate( HPANEL( HSdim_r, BLOCKSIZE) )
+      allocate( SPANEL( HSdim_r, BLOCKSIZE) )
+      allocate( kzz_ihelp(3,blocksize) )
+      allocate( j_end(blocksize),j_end_1(blocksize) )
+      allocate( PHASE(HSdim_r) )
+      allocate( spanelus(HSdim_r,blocksize))
+      allocate( SLEN(HSdim_r,blocksize), TMP1(HSdim_r) )
+      allocate( TMP2(HSdim_r) )
+      allocate( X2(HSdim_r,blocksize), X(HSdim_r) )
+      allocate( PPLX(HSdim_r), PPLY(HSdim_r) )
+      allocate( P(HSdim_r,0:LTOP-1,blocksize) )
+
+#if defined (INTEL_VML)
+      allocate( help1(HSdim_r), help_cos(HSdim_r), help_sin(HSdim_r) )
+!!_COMPLEX      allocate( help_tmpcos(HSdim_r), help_tmpsin(HSdim_r) )
+      allocate( tmp_ytmp(HSdim_r) )
+      allocate( tmp_y(HSdim_r) )
+!_COMPLEX      allocate( help_exp(HSdim_r) )
+#endif
+
+      CALL init_albl(ltop-1,'HAM')
+
+!$omp do schedule(dynamic,1)
       iouter_loop: DO IOUTER = 0, NV/(BLOCKSIZE)
 
+! To properly count the CPU and wallclock time with OpenMP we need to use
+! the max reduction, however we need the values to be initialized to 0 instead of -inf.
+! Proper solution is to bake our own with omp declare reduction, but this is currently
+! impossible due to missing ifort support (2018).
+#if defined (_OPENMP)
+      if (DTIMPH .lt. -HUGE(DTIMPH)) then
+         DTIMPH = ZERO
+         DTIMUS = ZERO
+         DTIMLG = ZERO
+         DTIMMA = ZERO
+         DTIMS = ZERO
+         DTIMH = ZERO
+         DTIMDIST=ZERO
+         time_albl_tot=zero
+         DTIMPHw = ZERO
+         DTIMUSw = ZERO
+         DTIMLGw = ZERO
+         DTIMMAw = ZERO
+         DTIMDISTw=ZERO
+         time_albl_totw=zero
+      endif
+#endif
+
 #ifdef Parallel
          ipc=mod(iouter,npcol)
          if (mycolhs.ne.ipc) cycle
@@ -284,11 +321,9 @@
                                     (YK_tmp_r(J)-YK_tmp_c)*(YK_tmp_r(J)-YK_tmp_c)+ &
                                     (ZK_tmp_r(J)-ZK_tmp_c)*(ZK_tmp_r(J)-ZK_tmp_c))
                x(j)=X2(J,ihelp)
-               den(j)=rk_tmp1*rk_tmp(j)
-            enddo
+               den=rk_tmp1*rk_tmp(j)
 
-            DO J = 1,j_end(ihelp)
-               IF (DEN(J) .GT. TOL) X(J) = X(J)/DEN(J)               
+               IF (DEN .GT. TOL) X(J) = X(J)/DEN
                PPLX(J) = X(J)*X(J)
                PPLY(J) = PPLX(J) - ONE
                P(J,0,ihelp) = ONE
@@ -374,10 +409,14 @@
                   ARGXX = ARGXX + POS(2,INDATM)*KZZ_ihelp(2,ihelp)
                   ARGXX = ARGXX + POS(3,INDATM)*KZZ_ihelp(3,ihelp)
 
+#if !defined (INTEL_VML)
+!$omp simd private(argx,help1,help_cos,help_exp)
+#endif
                   DO J = 1,j_end(ihelp)
                      ARGX = POS(1,INDATM)*KZZ_tmp(1,J) - ARGXX
                      ARGX = POS(2,INDATM)*KZZ_tmp(2,J) + ARGX
                      ARGX = POS(3,INDATM)*KZZ_tmp(3,J) + ARGX
+#if defined (INTEL_VML)
                      help1(j)=TWO*PI*(ARGX)
                   enddo
 
@@ -385,7 +424,6 @@
 !MASS-Lib help_cos(j)-cos(help1(j)), j=1,i
 ! or vmllib from Intel
 !
-#if defined (INTEL_VML)
 !_REAL            call vdcos(j_end(ihelp),help1,help_cos)
 ! in case of older mk/vml you may miss vzcis (IPO Error: unresolved : vzcis_)
 ! Either upgrade the libraries (ifort+mkl) or:
@@ -398,12 +436,18 @@
 !!_COMPLEX            help_exp(j) = dcmplx(help_tmpcos(j),help_tmpsin(j))
 !!_COMPLEX         end do
 #else
-!_REAL            call vcos(help_cos,help1,j_end(ihelp))
-!_COMPLEX         call vcosisin(help_exp,help1,j_end(ihelp))
+                     help1 = TWO*PI*(ARGX)
+!_REAL               help_cos = dcos(help1)
+!_COMPLEX            help_exp = dcmplx(dcos(help1),dsin(help1))
 #endif
+#if defined (INTEL_VML)
                   do j = 1,j_end(ihelp)
 !_REAL               PHASE(j) = PHASE(j) + help_cos(j)
 !_COMPLEX            PHASE(j) = PHASE(j) + help_exp(j)
+#else
+!_REAL               PHASE(j) = PHASE(j) + help_cos
+!_COMPLEX            PHASE(j) = PHASE(j) + help_exp
+#endif
                   enddo
                enddo 
 
@@ -420,16 +464,25 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
 
                   CALL START_TIMER(time_step_function)
 
+                  v3=THREE*V(JNEQ)
+
+#if !defined (INTEL_VML)
+!$omp simd private(help1,tmp_y,help_sin,help_cos,besr)
+#endif
                   do j=1,j_end_1(ihelp)
+#if defined (INTEL_VML)
                      help1(j) = slen(j,ihelp)*r(jneq)
                      tmp_y(j) = help1(j)*help1(j)*help1(j)
                   end do
 !MASS-Lib http://www.rs6000.ibm.com/resource/technology/MASS
 !MASS-Lib help_sin(j)=sin(help1(j)), help_cos(j)-cos(help1(j))
-#if defined (INTEL_VML)
                   call vdsincos(j_end_1(ihelp),help1,help_sin,help_cos)
 #else
-                  call vsincos(help_sin,help_cos,help1,j_end_1(ihelp))
+                     help1 = slen(j,ihelp)*r(jneq)
+                     tmp_y = help1*help1*help1
+                     help_sin = dsin(help1) 
+                     help_cos = dcos(help1)
+                     tmp_y = 1.0D0 / tmp_y
 #endif
 !MASS-Lib http://www.rs6000.ibm.com/resource/technology/MASS
 !MASS-Lib: tmp_y(j)=1/tmp_y(j), j=1,i-1
@@ -437,13 +490,14 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
 #if defined (INTEL_VML)
                   call vdinv(j_end_1(ihelp), tmp_y, tmp_ytmp)
                   call dcopy(j_end_1(ihelp), tmp_ytmp, 1, tmp_y, 1)
-#else
-                  call vrec(tmp_y,tmp_y,j_end_1(ihelp))
-#endif
-                  v3=THREE*V(JNEQ)
+
                   DO J = 1, j_end_1(ihelp)
                      BESR = (help1(j)*help_COS(j)-help_SIN(j))*tmp_y(j)
                      spanelus(j,ihelp)=spanelus(j,ihelp)+v3*PHASE(j)*BESR
+#else
+                     BESR = (help1*help_COS-help_SIN)*tmp_y
+                     spanelus(j,ihelp)=spanelus(j,ihelp)+v3*PHASE(j)*BESR
+#endif
                   enddo
 #ifdef Parallel
                   ipr=mod((i_g-1)/blocksize,nprow)
@@ -513,6 +567,7 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
                      ATMP = AL_c(ihelp,L)
                      atmp1 = ATMP*U(L,JNEQ)*DU(L,JNEQ)*R(JNEQ)**2
                      DLL1 = DBLE(L+L+1)
+!$omp simd private(tmp3,stmp,htmp)
                      DO J = 1, j_end(ihelp)
                         TMP3 = DLL1*P(J,L,ihelp)
                         STMP = ATMP*AL_r(J,L)
@@ -534,6 +589,7 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
                           AL_c(ihelp,L)*U(L,JNEQ)*DUp(L,JNEQ))
                      DLL1 = DBLE(L+L+1)
 
+!$omp simd private(tmp3,stmp,htmp)
                      DO J = 1, j_end(ihelp)
                         TMP3 = DLL1*P(J,L,ihelp)
                         STMP = ATMP*AL_r(J,L)
@@ -552,6 +608,7 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
                   
                enddo
 
+!$omp simd
                DO J = 1, j_end(ihelp)
                   TMP1(J) = TMP1(J)*VIFPR4
                   TMP2(J) = TMP2(J)*VIFPR4
@@ -595,28 +652,40 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
          DTIMDISTw = DTIMDISTw + READ_wall_TIME(time_distiouter)
 
         END DO iouter_loop
-
-      CALL STOP_TIMER(time_loop260)
+!$omp end do nowait
 
       call end_albl('HAM')
 
       deallocate( PHASE )
       deallocate( spanelus )
-!_COMPLEX      deallocate( help_exp )
-      deallocate( DEN, DFJ, FJ, FK, rk_tmp )
-      deallocate( PPLX, PPLY )
+      deallocate( j_end, j_end_1)
       deallocate( SLEN, TMP1,TMP2)
-      deallocate( X, XL, x2 )
+      deallocate( X, x2 )
+      deallocate( kzz_ihelp )
+      deallocate( PPLX, PPLY )
       deallocate( P )
-      deallocate( help1, help_cos, help_sin )
+#if defined (_OPENMP)
+      deallocate( HPANEL, SPANEL )
+#endif
+
 #if defined (INTEL_VML)
+      deallocate( help1, help_cos, help_sin )
 !!_COMPLEX      deallocate( help_tmpcos, help_tmpsin )
+!_COMPLEX      deallocate( help_exp )
       deallocate( tmp_ytmp )
-#endif
       deallocate( tmp_y )
-      deallocate( kzz_TMP,kzz_ihelp )
-      deallocate( j_end, j_end_1)
-      deallocate( xk_tmp_r,yk_tmp_r,zk_tmp_r)
+#endif
+
+!$omp single
+
+      CALL STOP_TIMER(time_loop260)
+
+
+!this is needed ATM since the first alloc was in a parallel section so we can't just reuse it
+#if defined (_OPENMP)
+      allocate( HPANEL( HSdim_r, BLOCKSIZE) )
+      allocate( SPANEL( HSdim_r, BLOCKSIZE) )
+#endif
 
       WRITE (6,'(a,2f12.1)') 'Time for al,bl    (hamilt, cpu/wall) : ', time_albl_tot, time_albl_totw
       WRITE (6,'(a,2f12.1)') 'Time for legendre (hamilt, cpu/wall) : ', DTIMLG, DTIMLGw
@@ -634,6 +703,8 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
 
       allocate( PHSC(HSrows) )
       allocate( YL(0:(LOMAX+1)*(LOMAX+1)-1,HSrows,multmax) )
+      nloat=nloat_new
+      allocate(C3R(nloat),C13R(nloat),C3RUP(nloat),C3IUP(nloat),C13RUP(nloat),C13IUP(nloat))
       write(6,'(a20,f12.1,a,a,3i6)') 'allocate YL  ',dble((LOMAX+1)*(LOMAX+1)*HSrows*multmax*16)/(1024.0*1024.0),' MB ',&
            '         dimensions',(LOMAX+1)*(LOMAX+1)-1,HSrows,multmax
       write(6,'(a20,f12.1,a,a,i6)') 'allocate phsc',dble(HSrows*16)/(1024.0*1024.0),' MB ',&
@@ -798,6 +869,10 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
 !
 !        determine phase factor
 !
+#if defined(_OPENMP) && !defined (Parallel)
+! cycle is not allowed in the openmp simd loop
+!$omp simd private(argx,argy,argz)
+#endif
                               DO J_g = 1, NV + NNLO
 #ifdef Parallel
                                     ipr=mod((j_g-1)/blocksize,nprow)
@@ -1085,7 +1160,14 @@ ins=0    !fixed PB 20.5.08. because alternative (rm case.vns) does not work anym
       ENDIF
 
       deallocate( hpanel,spanel)
-      deallocate(ROTV2,ROTV1,VEC)
+
+!$omp end single
+!$omp end parallel
+
+      deallocate( rk_tmp )
+      deallocate( XL )
+      deallocate( kzz_TMP )
+      deallocate( xk_tmp_r,yk_tmp_r,zk_tmp_r)
 
       CALL BARRIER
 
diff --git a/modules.F b/modules.F
index c6e9a3f..28d678f 100644
--- a/modules.F
+++ b/modules.F
@@ -279,6 +279,7 @@ BLOCKSIZE = 64
       double precision :: old_cpu_time(64)
       double precision :: wall_time(64)
       double precision :: old_wall_time(64)
+!$omp threadprivate(cpu_time,old_cpu_time,wall_time,old_wall_time)
       integer, parameter :: time_total=1,  &
                    time_hamilt=10, time_hns=11, time_diag=12, time_horb=13,  &
                  time_coor=15,  &
@@ -706,6 +707,7 @@ BLOCKSIZE = 64
 
         DOUBLE PRECISION, allocatable ::   AL(:,:,:)
         DOUBLE PRECISION, allocatable ::   BL(:,:,:)
+!$omp threadprivate(AL_r,BL_r,AL_c,BL_c,AL,BL)
 
       contains
 
@@ -1532,3 +1534,25 @@ BLOCKSIZE = 64
         real*8    korig(3)
 
       end module nmr
+
+      module stdmath
+
+#if defined (HAVE_LIBMVEC) && !defined(INTEL_VML)
+      interface
+        pure function dcos(x) bind(C, NAME="cos")
+          use,intrinsic :: iso_c_binding
+          real(c_double), value, intent(in) :: x
+          real(c_double) :: dcos
+!$omp declare simd (dcos)
+        end function
+
+        pure function dsin(x) bind(C, NAME="sin")
+          use,intrinsic :: iso_c_binding
+          real(c_double), value, intent(in) :: x
+          real(c_double) :: dsin
+!$omp declare simd (dsin)
+        end function
+      end interface
+#endif
+
+      end module stdmath
diff --git a/vectf.f b/vectf.f
deleted file mode 100644
index 28c5009..0000000
--- a/vectf.f
+++ /dev/null
@@ -1,39 +0,0 @@
-!MASS-Lib http://www.rs6000.ibm.com/resource/technology/MASS
-      subroutine vrec(y,x,n)
-      implicit none
-      real*8 x(*),y(*)
-      integer n,j
-      do 10 j=1,n
-      y(j)=1.d0/x(j)
-   10 continue
-      return
-      end
-      subroutine vsincos(x,y,z,n)
-      implicit none
-      real*8 x(*),y(*),z(*)
-      integer n, j
-      do 10 j=1,n
-      x(j)=sin(z(j))
-      y(j)=cos(z(j))
-   10 continue
-      return
-      end
-      subroutine vcos(y,x,n)
-      implicit none
-      real*8 x(*),y(*)
-      integer n,j
-      do 10 j=1,n
-      y(j)=cos(x(j))
-   10 continue
-      return
-      end
-      subroutine vcosisin(y,x,n)
-      implicit none
-      complex*16 y(*)
-      real*8 x(*)
-      integer n,j
-      do 10 j=1,n
-      y(j)=dcmplx(cos(x(j)),sin(x(j)))
-   10 continue
-      return
-      end

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Re: [Wien] System configuration

Reply via email to