Dear Peter and Gavin, Thank you for your help. Of course i went over the UG but your explanations cleared things up. I will be eventually doing supercell calculations on TiH and Ti surfaces so will look into the MPI errors in detail then. The PBS works fine now, with the #PBS -V command too. Many thanks again.
Yoji > On Jun 29, 2017, at 14:49, Yoji Kobayashi <yo...@scl.kyoto-u.ac.jp> wrote: > > Dear Users, > > I have a some questions/problems regarding parallelization and PBS. > I’m not sure if I’m really running parallel vs. serial, and my PBS script > isn’t working. > > === > My system info: > Intel Xeon CPU E5-2630 v2 @2.6 GHz, 24 CPUS > Memory: 32GB > Running Wien2k_13, on Ubuntu 14.04.03 > File system: ext4 > (This is considered a single node with 24 processors?) > === > My first question is, am I really running a parallel calculation in a > meaningful way? > > What I try: > In w2web, a serial calculation (SCF only) for the TiC example (500 k > points) takes about 25 sec. to converge. > I do the same calculation (starting with a new case) but setting > parallelization in w2web, with slightly different .machine files for each > case: > > Case 1: > 1:localhost > > Case 2 (i.e. 20 lines of below): > 1:localhost > 1:localhost > … > 1:localhost > 1:localhost > > Case 3 > 1:localhost:20 > > (no lines referring to granularity, etc for now) > > What I get: > Case 1 computes in about 54 sec; > Case 2 computes in 1min23 sec.; > Case 3 gives an error in running lapw2, see the dayfile below: > ----- > Calculating YK-016-TiC in /home/milkbar/Yoji/YK-016-TiC > on milkbar-computer with PID 18077 > using WIEN2k_13.1 (Release 17/6/2013) in /home/milkbar/WIEN2k_13 > > > start (2017å¹´ 6月 29æ—¥ 木曜日 14:23:39 JST) with lapw0 (40/99 > to go) > > cycle 1 (2017å¹´ 6月 29æ—¥ 木曜日 14:23:39 JST) (40/99 to go) > > > lapw0 -p (14:23:39) starting parallel lapw0 at 2017å¹´ 6月 29æ—¥ > > 木曜日 14:23:39 JST > -------- .machine0 : processors > running lapw0 in single mode > 1.7u 0.0s 0:01.84 98.3% 0+0k 16+440io 0pf+0w > > lapw1 -p (14:23:41) starting parallel lapw1 at 2017å¹´ 6月 > > 29æ—¥ 木曜日 14:23:41 JST > -> starting parallel LAPW1 jobs at 2017å¹´ 6月 29æ—¥ 木曜日 14:23:41 JST > running LAPW1 in parallel mode (using .machines) > 1 number_of_parallel_jobs > localhost localhost localhost localhost localhost localhost localhost > localhost localhost localhost localhost localhost localhost localhost > localhost localhost localhost localhost localhost localhost(20) 20 total > processes failed to start > 0.0u 0.0s 0:00.20 10.0% 0+0k 8080+8io 23pf+0w > Summary of lapw1para: > localhost k=0 user=0 wallclock=0 > 0.0u 0.0s 0:02.10 0.9% 0+0k 8208+216io 24pf+0w > > lapw2 -p (14:23:43) running LAPW2 in parallel mode > ** LAPW2 crashed! > 0.0u 0.0s 0:00.07 28.5% 0+0k 32+104io 0pf+0w > error: command /home/milkbar/WIEN2k_13/lapw2para lapw2.def failed > > > stop error > ------ > Is my “serial” calculation actually processed over 24 CPUs already, so this > is why it is faster than Case 2? Or am I doing something wrong? Why does Case > 3 crash? > > ==== > My second question is about PBS. > I installed torque PBS, and created a queue: > > # create default queue > qmgr -c 'create queue batch' > qmgr -c 'set queue batch queue_type = execution' > qmgr -c 'set queue batch started = true' > qmgr -c 'set queue batch enabled = true' > qmgr -c 'set queue batch resources_default.walltime = 1:00:00' > qmgr -c 'set queue batch resources_default.nodes = 1' > qmgr -c 'set server default_queue = batch’ > > and followed other instructions on > https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/ > > <https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/> > > The PBS system seems to work since I can submit very simple scripts and see > them on qstat. My problem is that when I try to submit a serial wien2k job > via PBS, it gives me an error (ultimately of course I’d like to submit them > as parallel, but because of the ambiguity above I’ve kept it to serial) . > Here's the PBS script and error message: > > #!/bin/tcsh > ##PBS -A your_allocation > # specify the allocation. Change it to your allocation > #PBS -q batch > #PBS -l nodes=1:ppn=20 > #PBS -l walltime=1:00:00 > #PBS -o wien2k_output > #PBS -j oe > #PBS -N wien2k_test > cd $PBS_O_WORKDIR > echo hello > run_lapw -i 40 -ec .0001 -I > > Error message (contents of wien2k_output): > hello > /var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12: > run_lapw: command not found > > The job is listed as complete in qstat, and the “hello” is written into the > wien2k_output file. Changing the cd $PBS_O_WORKDIR to the path for the > current case hasn’t changed anything. I can run run_lapw from the command > line fine, though. Also, what do I write for allocation? (I commented it out, > as I see other PBS scripts don’t always have this.) > > I’ve also tried the parallel case, with the following PBS script. I set up > the .structure file and do the initialization with w2web. I leave the > “parallel calculation” option unchecked when setting up the case file in > w2web. > > #!/bin/tcsh > ##PBS -A your_allocation > #PBS -q batch > #PBS -l nodes=1:ppn=20 > #PBS -l walltime=1:00:00 > # > #PBS -o wien2k_output > #PBS -j oe > #PBS -N wien2k_test > cd $PBS_O_WORKDIR > # > #cat $PBS_NODEFILE |cut -c1-6 >.machines_currentdd > #set aa=`wc .machines_current` > #echo '#' > .machines > # > ##example for k-point parallel lapw1/2 > set i=1 > while ($i <= $aa[1] ) > echo -n '1:' >>.machines > head -$i .machines_current |tail -1 >> .machines > @ i ++ > end > echo 'granularity:1' >>.machines > echo 'extrafine:1' >>.machines > # > #define here your Wien2k command > run_lapw -p -i 40 -ec .0001 -I > > When I submit this job via qsub, again the job is immediately listed as > complete in qstat, and I get the following error message in wien2k_output: > > milkbar@milkbar-computer:~/Yoji/YK-017-TiC$ cat wien2k_output > /var/spool/torque/mom_priv/jobs/45.milkbar-computer.kage.SC: line 28: syntax > error: unexpected end of file > > No .machines file has been created in the case folder. > How can I successfully submit serial/parallel PBS jobs? Thanks in advance > for your help. > > Yoji Kobayashi > > ========================================================== > Yoji Kobayashi, Junior Assoc. Prof. yo...@scl.kyoto-u.ac.jp > <mailto:yo...@scl.kyoto-u.ac.jp> > http://www.scl.kyoto-u.ac.jp/~yojik/index.htm > <http://www.scl.kyoto-u.ac.jp/~yojik/index.htm> > > Kageyama Group, Dept. of Energy and Hydrocarbon Chemistry > Graduate School of Engineering, Kyoto University > Nishikyo-ku, Kyoto 615-8510, Japan > > Tel.: +81-75-383-2509 Fax: +81-75-383-2510 > http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html > ========================================================== > ========================================================== 小林洋治 yo...@scl.kyoto-u.ac.jp http://www.scl.kyoto-u.ac.jp/~yojik/index.htm 〒615-8510 京都市西京区 京都大学桂 京都大学 大学院工学研究科 物質エネルギー化学専攻 陰山研究室 講師 Tel.: 075-383-2509 Fax: 075-383-2510 http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html ==========================================================
_______________________________________________ Wien mailing list Wien@zeus.theochem.tuwien.ac.at http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien SEARCH the MAILING-LIST at: http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html