I am trying to run GPU enabled QE (QE 6.8 running on Ubuntu 18.04.5 LTS (GNU/Linux 4.15.0-135-generic x86_64) System Configuration: Processor: Intel Xeon Gold 5120 CPU 2.20 GHz (2 Processor) RAM: 96 GB HDD: 6 TB Graphics Card: NVIDIA Quadro P5000 (16 GB))
I am successfully able to run small jobs (with dynamical ram ~1GB). However, when going to even larger systems (less than 16GB), the output abruptly stops during the first iteration(attached below) Program PWSCF v.6.8 starts on 8Oct2021 at 10:33:9 This program is part of the open-source Quantum ESPRESSO suite for quantum simulation of materials; please cite "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009); "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017); "P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020); URL http://www.quantum-espresso.org", in publications or presentations arising from this work. More details at http://www.quantum-espresso.org/quote Parallel version (MPI & OpenMP), running on 784 processor cores Number of MPI processes: 28 Threads/MPI process: 28 MPI processes distributed on 1 nodes R & G space division: proc/nbgrp/npool/nimage = 28 43440 MiB available memory on the printing compute node when the environment starts Reading input from 001.in Warning: card &CELL ignored Warning: card / ignored Current dimensions of program PWSCF are: Max number of different atomic species (ntypx) = 10 Max number of k-points (npk) = 40000 Max angular momentum in pseudopotentials (lmaxx) = 4 file Ti.pbe-spn-rrkjus_psl.1.0.0.upf: wavefunction(s) 3S 3D renormalized gamma-point specific algorithms are used Found symmetry operation: I + ( -0.0000 -0.5000 0.0000) This is a supercell, fractional translations are disabled Subspace diagonalization in iterative solution of the eigenvalue problem: a serial algorithm will be used Parallelization info -------------------- sticks: dense smooth PW G-vecs: dense smooth PW Min 637 232 57 81572 18102 2258 Max 640 234 60 81588 18118 2266 Sum 17865 6549 1633 2284245 507201 63345 Using Slab Decomposition bravais-lattice index = 14 lattice parameter (alat) = 21.0379 a.u. unit-cell volume = 9204.2807 (a.u.)^3 number of atoms/cell = 36 number of atomic types = 2 number of electrons = 288.00 number of Kohn-Sham states= 173 kinetic-energy cutoff = 55.0000 Ry charge density cutoff = 600.0000 Ry scf convergence threshold = 1.0E-06 mixing beta = 0.4000 number of iterations used = 8 local-TF mixing energy convergence thresh.= 1.0E-04 force convergence thresh. = 1.0E-03 Exchange-correlation= PBE ( 1 4 3 4 0 0 0) nstep = 500 GPU acceleration is ACTIVE. Message from routine print_cuda_info: High GPU oversubscription detected. Are you sure this is what you want? GPU used by master process: Device Number: 0 Device name: Quadro P5000 Compute capability : 61 Ratio of single to double precision performance : 32 Memory Clock Rate (KHz): 4513000 Memory Bus Width (bits): 256 Peak Memory Bandwidth (GB/s): 288.83 celldm(1)= 21.037943 celldm(2)= 1.000000 celldm(3)= 2.419041 celldm(4)= -0.766650 celldm(5)= -0.766650 celldm(6)= 0.533303 crystal axes: (cart. coord. in units of alat) a(1) = ( 1.000000 0.000000 0.000000 ) a(2) = ( 0.533303 0.845924 0.000000 ) a(3) = ( -1.854558 -1.023161 1.168553 ) reciprocal axes: (cart. coord. in units 2 pi/alat) b(1) = ( 1.000000 -0.630438 1.035056 ) b(2) = ( -0.000000 1.182139 1.035056 ) b(3) = ( 0.000000 0.000000 0.855759 ) PseudoPot. # 1 for Ti read from file: ../Ti.pbe-spn-rrkjus_psl.1.0.0.upf MD5 check sum: e281089c08e14b8efcf92e44a67ada65 Pseudo is Ultrasoft + core correction, Zval = 12.0 Generated using "atomic" code by A. Dal Corso v.6.2.2 Using radial grid of 1177 points, 6 beta functions with: l(1) = 0 l(2) = 0 l(3) = 1 l(4) = 1 l(5) = 2 l(6) = 2 Q(r) pseudized with 0 coefficients PseudoPot. # 2 for O read from file: ../O.pbe-n-rrkjus_psl.1.0.0.upf MD5 check sum: 91400c9766925bcf19f520983a725ff0 Pseudo is Ultrasoft + core correction, Zval = 6.0 Generated using "atomic" code by A. Dal Corso v.6.3MaX Using radial grid of 1095 points, 4 beta functions with: l(1) = 0 l(2) = 0 l(3) = 1 l(4) = 1 Q(r) pseudized with 0 coefficients atomic species valence mass pseudopotential Ti 12.00 47.86700 Ti( 1.00) O 6.00 15.99940 O ( 1.00) Starting magnetic structure atomic species magnetization Ti 0.200 O 0.000 No symmetry found s frac. trans. isym = 1 identity cryst. s( 1) = ( 1 0 0 ) ( 0 1 0 ) ( 0 0 1 ) cart. s( 1) = ( 1.0000000 0.0000000 0.0000000 ) ( 0.0000000 1.0000000 0.0000000 ) ( 0.0000000 0.0000000 1.0000000 ) point group C_1 (1) there are 1 classes the character table: E A 1.00 the symmetry operations in each class and the name of the first element: E 1 identity Cartesian axes site n. atom positions (alat units) 1 O tau( 1) = ( -0.8353365 -0.5987815 0.7050395 ) 2 Ti tau( 2) = ( -0.6772809 -0.5115821 0.7050395 ) 3 O tau( 3) = ( -0.5192254 -0.4243827 0.7050395 ) 4 Ti tau( 4) = ( -0.9272815 -0.5115821 0.5842738 ) 5 O tau( 5) = ( -0.7692260 -0.4243827 0.5842738 ) 6 O tau( 6) = ( -0.3186838 -0.1758181 0.5842738 ) 7 O tau( 7) = ( -0.4520098 -0.3872999 0.4635080 ) 8 Ti tau( 8) = ( -0.2939543 -0.3001004 0.4635080 ) 9 O tau( 9) = ( -0.1358987 -0.2129011 0.4635080 ) 10 O tau( 10) = ( -0.5686844 -0.1758181 0.7050395 ) 11 Ti tau( 11) = ( -0.4106289 -0.0886188 0.7050395 ) 12 O tau( 12) = ( -0.2525734 -0.0014194 0.7050395 ) 13 Ti tau( 13) = ( -0.6606296 -0.0886188 0.5842738 ) 14 O tau( 14) = ( -0.5025740 -0.0014194 0.5842738 ) 15 O tau( 15) = ( -0.0520318 0.2471452 0.5842738 ) 16 O tau( 16) = ( -0.1853578 0.0356635 0.4635080 ) 17 Ti tau( 17) = ( -0.0273023 0.1228629 0.4635080 ) 18 O tau( 18) = ( 0.1307533 0.2100623 0.4635080 ) 19 O tau( 19) = ( -0.3353351 -0.5987815 0.7050395 ) 20 Ti tau( 20) = ( -0.1772797 -0.5115821 0.7050395 ) 21 O tau( 21) = ( -0.0192241 -0.4243827 0.7050395 ) 22 Ti tau( 22) = ( -0.4272803 -0.5115821 0.5842738 ) 23 O tau( 23) = ( -0.2692247 -0.4243827 0.5842738 ) 24 O tau( 24) = ( 0.1813175 -0.1758181 0.5842738 ) 25 O tau( 25) = ( 0.0479915 -0.3872999 0.4635080 ) 26 Ti tau( 26) = ( 0.2060470 -0.3001004 0.4635080 ) 27 O tau( 27) = ( 0.3641026 -0.2129011 0.4635080 ) 28 O tau( 28) = ( -0.0686832 -0.1758181 0.7050395 ) 29 Ti tau( 29) = ( 0.0893724 -0.0886188 0.7050395 ) 30 O tau( 30) = ( 0.2474280 -0.0014194 0.7050395 ) 31 Ti tau( 31) = ( -0.1606282 -0.0886188 0.5842738 ) 32 O tau( 32) = ( -0.0025728 -0.0014194 0.5842738 ) 33 O tau( 33) = ( 0.4479695 0.2471452 0.5842738 ) 34 O tau( 34) = ( 0.3146435 0.0356635 0.4635080 ) 35 Ti tau( 35) = ( 0.4726991 0.1228629 0.4635080 ) 36 O tau( 36) = ( 0.6307546 0.2100623 0.4635080 ) Crystallographic axes site n. atom positions (cryst. coord.) 1 O tau( 1) = ( 0.2719137 0.0219125 0.6033439 ) 2 Ti tau( 2) = ( 0.3749954 0.1249943 0.6033439 ) 3 O tau( 3) = ( 0.4780771 0.2280761 0.6033439 ) 4 Ti tau( 4) = ( -0.0000046 -0.0000050 0.4999975 ) 5 O tau( 5) = ( 0.1030772 0.1030768 0.4999975 ) 6 O tau( 6) = ( 0.3969147 0.3969146 0.4999975 ) 7 O tau( 7) = ( 0.2719156 0.0219145 0.3966511 ) 8 Ti tau( 8) = ( 0.3749973 0.1249964 0.3966511 ) 9 O tau( 9) = ( 0.4780790 0.2280781 0.3966511 ) 10 O tau( 10) = ( 0.2719134 0.5219140 0.6033439 ) 11 Ti tau( 11) = ( 0.3749952 0.6249957 0.6033439 ) 12 O tau( 12) = ( 0.4780769 0.7280775 0.6033439 ) 13 Ti tau( 13) = ( -0.0000048 0.4999964 0.4999975 ) 14 O tau( 14) = ( 0.1030769 0.6030781 0.4999975 ) 15 O tau( 15) = ( 0.3969145 0.8969160 0.4999975 ) 16 O tau( 16) = ( 0.2719153 0.5219160 0.3966511 ) 17 Ti tau( 17) = ( 0.3749970 0.6249978 0.3966511 ) 18 O tau( 18) = ( 0.4780787 0.7280796 0.3966511 ) 19 O tau( 19) = ( 0.7719150 0.0219125 0.6033439 ) 20 Ti tau( 20) = ( 0.8749966 0.1249943 0.6033439 ) 21 O tau( 21) = ( 0.9780784 0.2280761 0.6033439 ) 22 Ti tau( 22) = ( 0.4999967 -0.0000050 0.4999975 ) 23 O tau( 23) = ( 0.6030784 0.1030768 0.4999975 ) 24 O tau( 24) = ( 0.8969160 0.3969146 0.4999975 ) 25 O tau( 25) = ( 0.7719169 0.0219145 0.3966511 ) 26 Ti tau( 26) = ( 0.8749985 0.1249964 0.3966511 ) 27 O tau( 27) = ( 0.9780803 0.2280781 0.3966511 ) 28 O tau( 28) = ( 0.7719147 0.5219140 0.6033439 ) 29 Ti tau( 29) = ( 0.8749965 0.6249957 0.6033439 ) 30 O tau( 30) = ( 0.9780782 0.7280775 0.6033439 ) 31 Ti tau( 31) = ( 0.4999965 0.4999964 0.4999975 ) 32 O tau( 32) = ( 0.6030782 0.6030781 0.4999975 ) 33 O tau( 33) = ( 0.8969158 0.8969160 0.4999975 ) 34 O tau( 34) = ( 0.7719166 0.5219160 0.3966511 ) 35 Ti tau( 35) = ( 0.8749983 0.6249978 0.3966511 ) 36 O tau( 36) = ( 0.9780801 0.7280796 0.3966511 ) number of k points= 1 Gaussian smearing, width (Ry)= 0.0100 cart. coord. in units 2pi/alat k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000 cryst. coord. k( 1) = ( 0.0000000 0.0000000 0.0000000), wk = 1.0000000 Dense grid: 1142123 G-vectors FFT dimensions: ( 180, 180, 400) Smooth grid: 253601 G-vectors FFT dimensions: ( 100, 100, 243) Dynamical RAM for wfc: 2.99 MB Dynamical RAM for wfc (w. buffer): 2.99 MB Dynamical RAM for str. fact: 1.24 MB Dynamical RAM for local pot: 0.00 MB Dynamical RAM for nlocal pot: 7.05 MB Dynamical RAM for qrad: 3.93 MB Dynamical RAM for rho,v,vnew: 25.98 MB Dynamical RAM for rhoin: 8.66 MB Dynamical RAM for G-vectors: 2.40 MB Dynamical RAM for h,s,v(r/c): 2.74 MB Dynamical RAM for <psi|beta>: 0.54 MB Dynamical RAM for psi: 5.98 MB Dynamical RAM for hpsi: 5.98 MB Dynamical RAM for spsi: 5.98 MB Dynamical RAM for wfcinit/wfcrot: 8.53 MB Dynamical RAM for addusdens: 131.34 MB Dynamical RAM for addusforce: 160.16 MB Estimated static dynamical RAM per process > 76.37 MB Estimated max dynamical RAM per process > 236.53 MB Estimated total dynamical RAM > 6.47 GB Check: negative core charge= -0.000001 Generating pointlists ... new r_m : 0.0722 (alat units) 1.5191 (a.u.) for type 1 new r_m : 0.0722 (alat units) 1.5191 (a.u.) for type 2 Initial potential from superposition of free atoms starting charge 287.98222, renormalised to 288.00000 negative rho (up, down): 9.119E-05 6.477E-05 Starting wfcs are 216 randomized atomic wfcs total cpu time spent up to now is 14.0 secs Self-consistent Calculation [tb_dev] Currently allocated 2.23E+01 Mbytes, locked: 0 / 9 [tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0 iteration # 1 ecut= 55.00 Ry beta= 0.40 Davidson diagonalization with overlap ---- Real-time Memory Report at c_bands before calling an iterative solver 980 MiB given to the printing process from OS 0 MiB allocation reported by mallinfo(arena+hblkhd) 32000 MiB available memory on the node where the printing process lives GPU memory used/free/total (MiB): 11117 / 5152 / 16270 ------------------ ethr = 1.00E-02, avg # of iterations = 1.5 The CRASH file generated says %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 24 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 14 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 5 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 7 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 15 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 17 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 10 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 9 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 12 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 4 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 13 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% task # 19 from addusdens_gpu : error # 1 cannot allocate aux2_d %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Using -ndiag 1 and -ntg1 with pw.x also gave a similar output with the following additional lines negative rho (up, down): 9.119E-05 6.477E-05 Starting wfcs are 216 randomized atomic wfcs total cpu time spent up to now is 11.9 secs Self-consistent Calculation [tb_dev] Currently allocated 3.21E+01 Mbytes, locked: 0 / 9 [tb_pin] Currently allocated 0.00E+00 Mbytes, locked: 0 / 0 iteration # 1 ecut= 55.00 Ry beta= 0.40 Davidson diagonalization with overlap ---- Real-time Memory Report at c_bands before calling an iterative solver 1036 MiB given to the printing process from OS 0 MiB allocation reported by mallinfo(arena+hblkhd) 36041 MiB available memory on the node where the printing process lives GPU memory used/free/total (MiB): 8915 / 7354 / 16270 ------------------ ethr = 1.00E-02, avg # of iterations = 1.5 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156244752 bytes requested; status = 2(out of memory) 0: ALLOCATE: 156239280 bytes requested; status = 2(out of memory) -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[58344,1],12] Exit code: 127 -------------------------------------------------------------------------- I believe I am not "filling the CPUs with OpenMP threads", or running 1 MPI on 1 GPU, as suggested in this document. Can someone please give some suggestions? Sorry for the long post. I am totally new to this field. Any help would be appreciated. Thanks in advance -- Sent by *ANSON THOMAS* *M.Sc. Chemistry, IIT Roorkee, India*
_______________________________________________ Quantum ESPRESSO is supported by MaX (www.max-centre.eu) users mailing list users@lists.quantum-espresso.org https://lists.quantum-espresso.org/mailman/listinfo/users