For a given use case, you may want to try all possible t and c such that t*c=n : stick to the best one.
Now, if you modify resources (t/c) and you get different timing/iterations, this seems logical to me: blocks, overlap, ... (and finally convergence) will differ so comparison does no more really make sense as you do something different (unless you fix t, and let c vary: even like that, you may not get what you expect - anyway, seems it's not what you do). Franck ----- Mail original ----- > De: "Damian Kaliszan" <[email protected]> > À: "Franck Houssen" <[email protected]>, "Barry Smith" > <[email protected]> > Cc: [email protected] > Envoyé: Mercredi 5 Juillet 2017 10:50:39 > Objet: Re: [petsc-users] Is OpenMP still available for PETSc? > > Thank you:) > > Few notes on what you wrote > 1. I always try to keep t*c=number of cores, however for 64 core KNL > which has hyperthreading switched on (cpuinfo shows 256 cores) t*c > should be 64 or 256 (in other words: is t=64 and c=4 correct?) ? > 2. I noticed that for the same input data I may get different > timings in 2 cases > a) different number of ksp iterations are observed (why they differ?) > -> please see screenshot Julia_N_10_4_vs_64.JPG for the following > config (this may be > related to 64*4 issue + which one is correct from first glance?): > > Matrix size=1000x1000 > > 1/ slurm-23716.out, 511 steps, ~ 28 secs > #SBATCH --nodes=1 > #SBATCH --ntasks=64 > #SBATCH --ntasks-per-node=64 > #SBATCH --cpus-per-task=4 > > > 2/ slurm-23718.out, 94 steps, ~ 4 secs > > #SBATCH --nodes=1 > #SBATCH --ntasks=4 > #SBATCH --ntasks-per-node=4 > #SBATCH --cpus-per-task=4 > > b) equal number of ksp iterations are observed but different timings > (this might be > due to false sharing or oversubscription ?) > please see Julia_N_10_64_vs_64.JPG screenshot > > > > Best, > Damian > > W liście datowanym 5 lipca 2017 (10:26:46) napisano: > > > The man page of slurm/sbatch is cumbersome. > > > But, you may think of : > > 1. tasks "as MPI processus" > > 2. cpus "as threads" > > > You should always set resources the most precise way when possible, > > that is (never use --tasks but prefer) to: > > 1. use --nodes=n. > > 2. use --tasks-per-node=t. > > 3. use --cpus-per-tasks=c. > > 4. for a start, make sure that t*c = number of cores you have per node. > > 5. use --exclusive unless you may have VERY different timing if you run > > twice the same job. > > 6. make sure mpi is configured correctly (run twice [or more] the > > same mono-thread application: get the same timing ?) > > 7. if using OpenMP or multithread applications, make sure you have > > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with > > intel). > > 8. make sure you have enough memory (--mem) unless performance may be > > degraded (swap). > > > The rule of thumb 4 may NOT be respected but if so, you need to be > > aware WHY you want to do that (for KNL, it may [or not] make sense > > [depending on cache modes]). > > > Remember than any multi-threaded (OpenMP or not) application may be > > victim of false sharing > > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile > > (using cache metrics) may help to understand if this is the problem, > > and track it if so (you may use perf-record for that). > > > Understanding HW is not an easy thing: you really need to go step > > by step unless you have no chance to understand anything in the end. > > > Hope this may help !... > > > Franck > > > Note: activating/deactivating hyper-threading (if available - > > generally in BIOS when possible) may also change performances. > > > ----- Mail original ----- > >> De: "Barry Smith" <[email protected]> > >> À: "Damian Kaliszan" <[email protected]> > >> Cc: "PETSc" <[email protected]> > >> Envoyé: Mardi 4 Juillet 2017 19:04:36 > >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc? > >> > >> > >> You may need to ask a slurm expert. I have no idea what cpus-per-task > >> means > >> > >> > >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <[email protected]> > >> > wrote: > >> > > >> > Hi, > >> > > >> > Yes, this is exactly what I meant. > >> > Please find attached output for 2 input datasets and for 2 various slurm > >> > configs each: > >> > > >> > A/ Matrix size=8000000x8000000 > >> > > >> > 1/ slurm-14432809.out, 930 ksp steps, ~90 secs > >> > > >> > > >> > #SBATCH --nodes=2 > >> > #SBATCH --ntasks=32 > >> > #SBATCH --ntasks-per-node=16 > >> > #SBATCH --cpus-per-task=4 > >> > > >> > 2/ slurm-14432810.out , 100.000 ksp steps, ~9700 secs > >> > > >> > #SBATCH --nodes=2 > >> > #SBATCH --ntasks=32 > >> > #SBATCH --ntasks-per-node=16 > >> > #SBATCH --cpus-per-task=2 > >> > > >> > > >> > > >> > B/ Matrix size=1000x1000 > >> > > >> > 1/ slurm-23716.out, 511 ksp steps, ~ 28 secs > >> > #SBATCH --nodes=1 > >> > #SBATCH --ntasks=64 > >> > #SBATCH --ntasks-per-node=64 > >> > #SBATCH --cpus-per-task=4 > >> > > >> > > >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs > >> > > >> > #SBATCH --nodes=1 > >> > #SBATCH --ntasks=4 > >> > #SBATCH --ntasks-per-node=4 > >> > #SBATCH --cpus-per-task=4 > >> > > >> > > >> > I would really appreciate any help...:) > >> > > >> > Best, > >> > Damian > >> > > >> > > >> > > >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano: > >> > > >> > > >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <[email protected]> > >> > wrote: > >> > Hi, > >> > > >> > > >> > >> 1) You can call Bcast on PETSC_COMM_WORLD > >> > > >> > To be honest I can't find Bcast method in petsc4py.PETSc.Comm (I'm > >> > using petsc4py) > >> > > >> > >> 2) If you are using WORLD, the number of iterates will be the same on > >> > >> each process since iteration is collective. > >> > > >> > Yes, this is how it should be. But what I noticed is that for > >> > different --cpus-per-task numbers in slurm script I get different > >> > number of solver iterations which is in turn related to timings. The > >> > imparity is huge. For example for some configurations where > >> > --cpus-per-task=1 I receive 900 > >> > iterations and for --cpus-per-task=2 I receive valid number of 100.000 > >> > which is set as max > >> > iter number set when setting solver tolerances. > >> > > >> > I am trying to understand what you are saying. You mean that you make 2 > >> > different runs and get a different > >> > number of iterates with a KSP? In order to answer questions about > >> > convergence, we need to see the output > >> > of > >> > > >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason > >> > > >> > for all cases. > >> > > >> > Thanks, > >> > > >> > Matt > >> > > >> > Best, > >> > Damian > >> > > >> > > >> > > >> > > >> > -- > >> > What most experimenters take for granted before they begin their > >> > experiments is infinitely more interesting than any results to which > >> > their > >> > experiments lead. > >> > -- Norbert Wiener > >> > > >> > http://www.caam.rice.edu/~mk51/ > >> > > >> > > >> > > >> > > >> > > >> > ------------------------------------------------------- > >> > Damian Kaliszan > >> > > >> > Poznan Supercomputing and Networking Center > >> > HPC and Data Centres Technologies > >> > ul. Jana Pawła II 10 > >> > 61-139 Poznan > >> > POLAND > >> > > >> > phone (+48 61) 858 5109 > >> > e-mail [email protected] > >> > www - http://www.man.poznan.pl/ > >> > ------------------------------------------------------- > >> > <slum_output.zip> > >> > >>
