Thank you:) Few notes on what you wrote 1. I always try to keep t*c=number of cores, however for 64 core KNL which has hyperthreading switched on (cpuinfo shows 256 cores) t*c should be 64 or 256 (in other words: is t=64 and c=4 correct?) ? 2. I noticed that for the same input data I may get different timings in 2 cases a) different number of ksp iterations are observed (why they differ?) -> please see screenshot Julia_N_10_4_vs_64.JPG for the following config (this may be related to 64*4 issue + which one is correct from first glance?):
Matrix size=1000x1000 1/ slurm-23716.out, 511 steps, ~ 28 secs #SBATCH --nodes=1 #SBATCH --ntasks=64 #SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=4 2/ slurm-23718.out, 94 steps, ~ 4 secs #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=4 b) equal number of ksp iterations are observed but different timings (this might be due to false sharing or oversubscription ?) please see Julia_N_10_64_vs_64.JPG screenshot Best, Damian W liście datowanym 5 lipca 2017 (10:26:46) napisano: > The man page of slurm/sbatch is cumbersome. > But, you may think of : > 1. tasks "as MPI processus" > 2. cpus "as threads" > You should always set resources the most precise way when possible, > that is (never use --tasks but prefer) to: > 1. use --nodes=n. > 2. use --tasks-per-node=t. > 3. use --cpus-per-tasks=c. > 4. for a start, make sure that t*c = number of cores you have per node. > 5. use --exclusive unless you may have VERY different timing if you run twice > the same job. > 6. make sure mpi is configured correctly (run twice [or more] the > same mono-thread application: get the same timing ?) > 7. if using OpenMP or multithread applications, make sure you have > set affinity properly (GOMP_CPU_AFFINITY whith gnu, KMP_AFFINITY with intel). > 8. make sure you have enough memory (--mem) unless performance may be > degraded (swap). > The rule of thumb 4 may NOT be respected but if so, you need to be > aware WHY you want to do that (for KNL, it may [or not] make sense [depending > on cache modes]). > Remember than any multi-threaded (OpenMP or not) application may be > victim of false sharing > (https://en.wikipedia.org/wiki/False_sharing): in this case, profile > (using cache metrics) may help to understand if this is the problem, > and track it if so (you may use perf-record for that). > Understanding HW is not an easy thing: you really need to go step > by step unless you have no chance to understand anything in the end. > Hope this may help !... > Franck > Note: activating/deactivating hyper-threading (if available - > generally in BIOS when possible) may also change performances. > ----- Mail original ----- >> De: "Barry Smith" <[email protected]> >> À: "Damian Kaliszan" <[email protected]> >> Cc: "PETSc" <[email protected]> >> Envoyé: Mardi 4 Juillet 2017 19:04:36 >> Objet: Re: [petsc-users] Is OpenMP still available for PETSc? >> >> >> You may need to ask a slurm expert. I have no idea what cpus-per-task >> means >> >> >> > On Jul 4, 2017, at 4:16 AM, Damian Kaliszan <[email protected]> wrote: >> > >> > Hi, >> > >> > Yes, this is exactly what I meant. >> > Please find attached output for 2 input datasets and for 2 various slurm >> > configs each: >> > >> > A/ Matrix size=8000000x8000000 >> > >> > 1/ slurm-14432809.out, 930 ksp steps, ~90 secs >> > >> > >> > #SBATCH --nodes=2 >> > #SBATCH --ntasks=32 >> > #SBATCH --ntasks-per-node=16 >> > #SBATCH --cpus-per-task=4 >> > >> > 2/ slurm-14432810.out , 100.000 ksp steps, ~9700 secs >> > >> > #SBATCH --nodes=2 >> > #SBATCH --ntasks=32 >> > #SBATCH --ntasks-per-node=16 >> > #SBATCH --cpus-per-task=2 >> > >> > >> > >> > B/ Matrix size=1000x1000 >> > >> > 1/ slurm-23716.out, 511 ksp steps, ~ 28 secs >> > #SBATCH --nodes=1 >> > #SBATCH --ntasks=64 >> > #SBATCH --ntasks-per-node=64 >> > #SBATCH --cpus-per-task=4 >> > >> > >> > 2/ slurm-23718.out, 94 ksp steps, ~ 4 secs >> > >> > #SBATCH --nodes=1 >> > #SBATCH --ntasks=4 >> > #SBATCH --ntasks-per-node=4 >> > #SBATCH --cpus-per-task=4 >> > >> > >> > I would really appreciate any help...:) >> > >> > Best, >> > Damian >> > >> > >> > >> > W liście datowanym 3 lipca 2017 (16:29:15) napisano: >> > >> > >> > On Mon, Jul 3, 2017 at 9:23 AM, Damian Kaliszan <[email protected]> >> > wrote: >> > Hi, >> > >> > >> > >> 1) You can call Bcast on PETSC_COMM_WORLD >> > >> > To be honest I can't find Bcast method in petsc4py.PETSc.Comm (I'm >> > using petsc4py) >> > >> > >> 2) If you are using WORLD, the number of iterates will be the same on >> > >> each process since iteration is collective. >> > >> > Yes, this is how it should be. But what I noticed is that for >> > different --cpus-per-task numbers in slurm script I get different >> > number of solver iterations which is in turn related to timings. The >> > imparity is huge. For example for some configurations where >> > --cpus-per-task=1 I receive 900 >> > iterations and for --cpus-per-task=2 I receive valid number of 100.000 >> > which is set as max >> > iter number set when setting solver tolerances. >> > >> > I am trying to understand what you are saying. You mean that you make 2 >> > different runs and get a different >> > number of iterates with a KSP? In order to answer questions about >> > convergence, we need to see the output >> > of >> > >> > -ksp_view -ksp_monitor_true_residual -ksp_converged_reason >> > >> > for all cases. >> > >> > Thanks, >> > >> > Matt >> > >> > Best, >> > Damian >> > >> > >> > >> > >> > -- >> > What most experimenters take for granted before they begin their >> > experiments is infinitely more interesting than any results to which their >> > experiments lead. >> > -- Norbert Wiener >> > >> > http://www.caam.rice.edu/~mk51/ >> > >> > >> > >> > >> >
