Dear Peter and Gavin,

Thank you for your help. Of course i went over the UG but your explanations 
cleared things up. I will be eventually doing supercell calculations on TiH and 
Ti surfaces so will look into the MPI errors in detail then. The PBS works fine 
now, with the #PBS -V command too. Many thanks again.

Yoji


> On Jun 29, 2017, at 14:49, Yoji Kobayashi <yo...@scl.kyoto-u.ac.jp> wrote:
> 
> Dear Users,
> 
> I have a some questions/problems regarding parallelization and PBS. 
> I’m not sure if I’m really running parallel vs. serial, and my PBS script 
> isn’t working.
> 
> ===
> My system info:
> Intel Xeon CPU E5-2630 v2 @2.6 GHz, 24 CPUS
> Memory: 32GB
> Running Wien2k_13, on Ubuntu 14.04.03
> File system: ext4
> (This is considered a single node with 24 processors?)
> ===
> My first question is, am I really running a parallel calculation in a 
> meaningful way?
> 
> What I try:
> In w2web, a serial calculation (SCF only)  for the TiC example  (500 k 
> points) takes about 25 sec. to converge.
> I do the same calculation (starting with a new case) but setting 
> parallelization in w2web, with slightly different .machine files for each 
> case:
> 
> Case 1:
> 1:localhost
> 
> Case 2 (i.e. 20 lines of below):
> 1:localhost
> 1:localhost
> …
> 1:localhost
> 1:localhost
> 
> Case 3
> 1:localhost:20
> 
> (no lines referring to granularity, etc for now)
> 
> What I get:
> Case 1 computes in about 54 sec;
> Case 2 computes in 1min23 sec.;
> Case 3 gives an error in running lapw2, see the dayfile below:
> -----
> Calculating YK-016-TiC in /home/milkbar/Yoji/YK-016-TiC
> on milkbar-computer with PID 18077
> using WIEN2k_13.1 (Release 17/6/2013) in /home/milkbar/WIEN2k_13
> 
> 
>     start     (2017年  6月 29日 木曜日 14:23:39 JST) with lapw0 (40/99 
> to go)
> 
>     cycle 1   (2017年  6月 29日 木曜日 14:23:39 JST)    (40/99 to go)
> 
> >   lapw0 -p  (14:23:39) starting parallel lapw0 at 2017年  6月 29日 
> > 木曜日 14:23:39 JST
> -------- .machine0 : processors
> running lapw0 in single mode
> 1.7u 0.0s 0:01.84 98.3% 0+0k 16+440io 0pf+0w
> >   lapw1  -p         (14:23:41) starting parallel lapw1 at 2017年  6月 
> > 29日 木曜日 14:23:41 JST
> ->  starting parallel LAPW1 jobs at 2017年  6月 29日 木曜日 14:23:41 JST
> running LAPW1 in parallel mode (using .machines)
> 1 number_of_parallel_jobs
>      localhost localhost localhost localhost localhost localhost localhost 
> localhost localhost localhost localhost localhost localhost localhost 
> localhost localhost localhost localhost localhost localhost(20) 20 total 
> processes failed to start
> 0.0u 0.0s 0:00.20 10.0% 0+0k 8080+8io 23pf+0w
>    Summary of lapw1para:
>    localhost   k=0     user=0  wallclock=0
> 0.0u 0.0s 0:02.10 0.9% 0+0k 8208+216io 24pf+0w
> >   lapw2 -p          (14:23:43) running LAPW2 in parallel mode
> **  LAPW2 crashed!
> 0.0u 0.0s 0:00.07 28.5% 0+0k 32+104io 0pf+0w
> error: command   /home/milkbar/WIEN2k_13/lapw2para lapw2.def   failed
> 
> >   stop error
> ------
> Is my “serial” calculation actually processed over 24 CPUs already, so this 
> is why it is faster than Case 2? Or am I doing something wrong? Why does Case 
> 3 crash? 
> 
> ====
> My second question is about PBS.
> I installed torque PBS, and created a queue:
> 
> # create default queue
>  qmgr -c 'create queue batch'
>  qmgr -c 'set queue batch queue_type = execution'
>  qmgr -c 'set queue batch started = true'
>  qmgr -c 'set queue batch enabled = true'
>  qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
>  qmgr -c 'set queue batch resources_default.nodes = 1'
>  qmgr -c 'set server default_queue = batch’
> 
> and followed other instructions on
> https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/
>  
> <https://jabriffa.wordpress.com/2015/02/11/installing-torquepbs-job-scheduler-on-ubuntu-14-04-lts/>
> 
> The PBS system seems to work since I can submit very simple scripts and see 
> them on qstat. My problem is that when I try to submit a serial wien2k job 
> via PBS, it gives me an error (ultimately of course I’d like to submit them 
> as parallel, but because of the ambiguity above I’ve kept it to serial) . 
> Here's the PBS script and error message:
> 
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  # specify the allocation. Change it to your allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  echo hello
>  run_lapw -i 40 -ec .0001 -I
> 
> Error message (contents of wien2k_output):
> hello
> /var/spool/torque/mom_priv/jobs/44.milkbar-computer.kage.SC: line 12: 
> run_lapw: command not found
> 
> The job is listed as complete in qstat, and the “hello” is written into the 
> wien2k_output file. Changing the cd $PBS_O_WORKDIR to the path for the 
> current case hasn’t changed anything. I can run run_lapw from the command 
> line fine, though. Also, what do I write for allocation? (I commented it out, 
> as I see other PBS scripts don’t always have this.)
> 
> I’ve also tried the parallel case, with the following PBS script. I set up 
> the .structure file and do the initialization with w2web. I leave the 
> “parallel calculation” option unchecked when setting up the case file in 
> w2web.
> 
>  #!/bin/tcsh
>  ##PBS -A your_allocation
>  #PBS -q batch
>  #PBS -l nodes=1:ppn=20
>  #PBS -l walltime=1:00:00
>  #
>  #PBS -o wien2k_output
>  #PBS -j oe
>  #PBS -N wien2k_test
>  cd $PBS_O_WORKDIR
>  #
>  #cat $PBS_NODEFILE |cut -c1-6 >.machines_currentdd
>  #set aa=`wc .machines_current`
>  #echo '#' > .machines
>  #
>  ##example for k-point parallel lapw1/2
>  set i=1
>       while ($i <= $aa[1] )
>       echo -n '1:' >>.machines
>       head -$i .machines_current |tail -1 >> .machines
>       @ i ++
>  end
> echo 'granularity:1' >>.machines
> echo 'extrafine:1' >>.machines
>  #
>  #define here your Wien2k command
>  run_lapw -p -i 40 -ec .0001 -I
> 
> When I submit this job via qsub, again the job is immediately listed as 
> complete in qstat, and I get the following error message in wien2k_output:
> 
> milkbar@milkbar-computer:~/Yoji/YK-017-TiC$ cat wien2k_output
> /var/spool/torque/mom_priv/jobs/45.milkbar-computer.kage.SC: line 28: syntax 
> error: unexpected end of file
> 
> No .machines file has been created in the case folder. 
>  How can I successfully submit serial/parallel PBS jobs? Thanks in advance 
> for your help.
> 
> Yoji Kobayashi
> 
> ==========================================================
> Yoji Kobayashi, Junior Assoc. Prof.       yo...@scl.kyoto-u.ac.jp 
> <mailto:yo...@scl.kyoto-u.ac.jp>
> http://www.scl.kyoto-u.ac.jp/~yojik/index.htm 
> <http://www.scl.kyoto-u.ac.jp/~yojik/index.htm>
> 
> Kageyama Group, Dept. of Energy and Hydrocarbon Chemistry
> Graduate School of Engineering, Kyoto University
> Nishikyo-ku, Kyoto 615-8510, Japan
> 
> Tel.: +81-75-383-2509     Fax: +81-75-383-2510
> http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
> ==========================================================
> 

==========================================================
小林洋治  yo...@scl.kyoto-u.ac.jp
http://www.scl.kyoto-u.ac.jp/~yojik/index.htm

〒615-8510 京都市西京区 京都大学桂
京都大学 大学院工学研究科 物質エネルギー化学専攻
陰山研究室 講師

Tel.: 075-383-2509    Fax: 075-383-2510
http://www.ehcc.kyoto-u.ac.jp/eh10/kageyama.html
==========================================================

_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to