Dear Franck, Thank you for tour comment! Did I get you correctly: According to you may t<->c combination influence the the convergence -> ksp iterations -> timings (result array/vector should be identical though)?
Best, Damian W liście datowanym 5 lipca 2017 (15:09:59) napisano: > For a given use case, you may want to try all possible t and c such > that t*c=n : stick to the best one. > Now, if you modify resources (t/c) and you get different > timing/iterations, this seems logical to me: blocks, overlap, ... > (and finally convergence) will differ so comparison does no more > really make sense as you do something different (unless you fix t, > and let c vary: even like that, you may not get what you expect - > anyway, seems it's not what you do). > Franck > ----- Mail original ----- >> De: "Damian Kaliszan" <[email protected]> >> À: "Franck Houssen" <[email protected]>, "Barry Smith" >> <[email protected]> >> Cc: [email protected] >> Envoyé: Mercredi 5 Juillet 2017 10:50:39 >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc? >> >> Thank you:) >> >> Few notes on what you wrote >> 1. I always try to keep t*c=number of cores, however for 64 core KNL >> which has hyperthreading switched on (cpuinfo shows 256 cores) t*c >> should be 64 or 256 (in other words: is t=64 and c=4 correct?) ? >> 2. I noticed that for the same input data I may get different >> timings in 2 cases >> a) different number of ksp iterations are observed (why they differ?) >> -> please see screenshot Julia_N_10_4_vs_64.JPG for the following >> config (this may be >> related to 64*4 issue + which one is correct from first glance?): >> >> Matrix size=1000x1000 >> >> 1/ slurm-23716.out, 511 steps, ~ 28 secs >> #SBATCH --nodes=1 >> #SBATCH --ntasks=64 >> #SBATCH --ntasks-per-node=64 >> #SBATCH --cpus-per-task=4 >> >> >> 2/ slurm-23718.out, 94 steps, ~ 4 secs >> >> #SBATCH --nodes=1 >> #SBATCH --ntasks=4 >> #SBATCH --ntasks-per-node=4 >> #SBATCH --cpus-per-task=4 >> >> b) equal number of ksp iterations are observed but different timings >> (this might be >> due to false sharing or oversubscription ?) >> please see Julia_N_10_64_vs_64.JPG screenshot >> >> >> >> Best, >> Damian >> >> W liście datowanym 5 lipca 2017 (10:26:46) napisano: >> >> > The man page of slurm/sbatch is cumbersome. >> >> > But, you may think of : >> > 1. tasks "as MPI processus" >> > 2. cpus "as threads" >> >> > You should always set resources the most precise way when possible, >> > that is (never use --tasks but prefer) to: >> > 1. use --nodes=n. >> > 2. use --tasks-per-node=t. >> > 3. use --cpus-per-tasks=c. >> > 4. for a start, make sure that t*c = number of cores you have per node. >> > 5. use --exclusive unless you may have VERY different timing if you run >> > twice the same job. >> > 6. make sure mpi is configured correctly (run twice [or more] the >> > same mono-thread application: get the same timing ?) >> > 7. if using OpenMP or multithread applications, make sure you have >> > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with >> > intel). >> > 8. make sure you have enough memory (--mem) unless performance may be >> > degraded (swap). >> >> > The rule of thumb 4 may NOT be respected but if so, you need to be >> > aware WHY you want to do that (for KNL, it may [or not] make sense >> > [depending on cache modes]). >> >> > Remember than any multi-threaded (OpenMP or not) application may be >> > victim of false sharing >> > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile >> > (using cache metrics) may help to understand if this is the problem, >> > and track it if so (you may use perf-record for that). >> >> > Understanding HW is not an easy thing: you really need to go step >> > by step unless you have no chance to understand anything in the end. >> >> > Hope this may help !... >> >> > Franck >> >> > Note: activating/deactivating hyper-threading (if available - >> > generally in BIOS when possible) may also change performances. >> >> > ----- Mail original ----- >> >> De: "Barry Smith" <[email protected]> >> >> À: "Damian Kaliszan" <[email protected]> >> >> Cc: "PETSc" <[email protected]> >> >> Envoyé: Mardi 4 Juillet 2017 19:04:36 >> >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc? >> >> >> >> >> >> You may need to ask a slurm expert. I have no idea what cpus-per-task >> >> means >> >> >> >> >> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <[email protected]> >> >> > wrote: >> >> > >> >> > Hi, >> >> > >> >> > Yes, this is exactly what I meant. >> >> > Please find attached output for 2 input datasets and for 2 various slurm >> >> > configs each: >> >> > >> >> > A/ Matrix size=8000000x8000000 >> >> > >> >> > 1/ slurm-14432809.out, 930 ksp steps, ~90 secs >> >> > >> >> > >> >> > #SBATCH --nodes=2 >> >> > #SBATCH --ntasks=32 >> >> > #SBATCH --ntasks-per-node=16 >> >> > #SBATCH --cpus-per-task=4 >> >> > >> >> > 2/ slurm-14432810.out , 100.000 ksp steps, ~9700 secs >> >> > >> >> > #SBATCH --nodes=2 >> >> > #SBATCH --ntasks=32 >> >> > #SBATCH --ntasks-per-node=16 >> >> > #SBATCH --cpus-per-task=2 >> >> > >> >> > >> >> > >> >> > B/ Matrix size=1000x1000 >> >> > >> >> > 1/ slurm-23716.out, 511 ksp steps, ~ 28 secs >> >> > #SBATCH --nodes=1 >> >> > #SBATCH --ntasks=64 >> >> > #SBATCH --ntasks-per-node=64 >> >> > #SBATCH --cpus-per-task=4 >> >> > >> >> > >> >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs >> >> > >> >> > #SBATCH --nodes=1 >> >> > #SBATCH --ntasks=4 >> >> > #SBATCH --ntasks-per-node=4 >> >> > #SBATCH --cpus-per-task=4 >> >> > >> >> > >> >> > I would really appreciate any help...:) >> >> > >> >> > Best, >> >> > Damian >> >> > >> >> > >> >> > >> >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano: >> >> > >> >> > >> >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <[email protected]> >> >> > wrote: >> >> > Hi, >> >> > >> >> > >> >> > >> 1) You can call Bcast on PETSC_COMM_WORLD >> >> > >> >> > To be honest I can't find Bcast method in petsc4py.PETSc.Comm (I'm >> >> > using petsc4py) >> >> > >> >> > >> 2) If you are using WORLD, the number of iterates will be the same on >> >> > >> each process since iteration is collective. >> >> > >> >> > Yes, this is how it should be. But what I noticed is that for >> >> > different --cpus-per-task numbers in slurm script I get different >> >> > number of solver iterations which is in turn related to timings. The >> >> > imparity is huge. For example for some configurations where >> >> > --cpus-per-task=1 I receive 900 >> >> > iterations and for --cpus-per-task=2 I receive valid number of 100.000 >> >> > which is set as max >> >> > iter number set when setting solver tolerances. >> >> > >> >> > I am trying to understand what you are saying. You mean that you make 2 >> >> > different runs and get a different >> >> > number of iterates with a KSP? In order to answer questions about >> >> > convergence, we need to see the output >> >> > of >> >> > >> >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason >> >> > >> >> > for all cases. >> >> > >> >> > Thanks, >> >> > >> >> > Matt >> >> > >> >> > Best, >> >> > Damian >> >> > >> >> > >> >> > >> >> > >> >> > -- >> >> > What most experimenters take for granted before they begin their >> >> > experiments is infinitely more interesting than any results to which >> >> > their >> >> > experiments lead. >> >> > -- Norbert Wiener >> >> > >> >> > http://www.caam.rice.edu/~mk51/ >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > ------------------------------------------------------- >> >> > Damian Kaliszan >> >> > >> >> > Poznan Supercomputing and Networking Center >> >> > HPC and Data Centres Technologies >> >> > ul. Jana Pawła II 10 >> >> > 61-139 Poznan >> >> > POLAND >> >> > >> >> > phone (+48 61) 858 5109 >> >> > e-mail [email protected] >> >> > www - http://www.man.poznan.pl/ >> >> > ------------------------------------------------------- >> >> > <slum_output.zip> >> >> >> >>
