Re: [gmx-users] Tests with Threadripper and dual gpu setup

2018-02-12 Thread Harry Mark Greenblatt
BS”D



Did you build with or without hwloc?

I did use hwloc.




—
Gromacs 2018 rc1 (using gcc 4.8.5)
—

Using AVX_256


You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest
in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated
runs.


Once I saw that AVX2_128 was faster, I did not think there would be an 
advantage to AVX2_256 on GPU runs.

So is there any suggestion to overcome the problem of gcc 5.5 not recognising 
what CPU hardware I have (not that 5.5 gave much of an advantage in Gromacs 
2016).


Now force Dynamic Load Balancing

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4
-npme 1 -gputasks 0011 -nb gpu -dlb yes


I would recommend *against* doing that unless you have concrete cases where
this is better than "-dlb auto" -- and if you have such cases, please share
them as it is not expected behavior. (Note: DLB has acquired the capability
to observe when turning it on it leads to performance drop and it switches
off automatically in such cases!)


I did see that it in some cases it was turning off DLB for a while, or for the 
rest of the run.

In my case however, I did get better results by forcing it to be on.

I can send the .tpr file to you, off-list, if you want…



Thanks

Harry





Harry M. Greenblatt
Associate Staff Scientist
Dept of Structural Biology   
harry.greenbl...@weizmann.ac.il<../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyXxHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.=mailto%3aharry.greenblatt%40weizmann.ac.il>
Weizmann Institute of SciencePhone:  972-8-934-6340
234 Herzl St.Facsimile:   972-8-934-3361
Rehovot, 7610001
Israel

-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Tests with Threadripper and dual gpu setup

2018-02-09 Thread Szilárd Páll
Hi,

Thanks for the report!

Did you build with or without hwloc? There is a known issue with the
automatic pin stride when not using hwloc which will lead to a "compact"
pinning (using half of the cores with 2 threads/core) when <=half of the
threads are launched (instead of using all cores 1 thread/core which is the
default on Intel).

When it comes to running "wide" ranks (i.e. many OpenMP threads per rank)
on Zen/Ryzen, things are not straightforward, so the default 16/32 threads
on 16 cores + 1 GPU is not great. If already running domain-decomposition,
4-8 threads/rank is generally best, but unfortunately this will often not
be better than just using no DD and taking the hit of threading
inefficiency.

A few more comments in-line.

On Wed, Jan 24, 2018 at 10:14 AM, Harry Mark Greenblatt <
harry.greenbl...@weizmann.ac.il> wrote:

> BS”D
>
> In case anybody is interested we have tested Gromacs on a Threadripper
> machine with two GPU’s.
>
> Hardware:
>
> Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair
> H100i V2 Liquid cooling
> Asus Prime X399-A M/B
> 2 X Geforce GTX 1080 GPU’s
> 32 GB of 3200MHz memory
> Samsung 850 Pro 512GB SSD
>
> OS, software:
>
> Centos 7.4, with 4.14 Kernel from ElRepo
> gcc 4.8.5 and gcc 5.5.0
> fftw 3.3.7 (AVX2 enabled)
> Cuda 8
> Gromacs 2016.4
> Gromacs 2018-rc1 and final 2018.
> Using thread-MPI
>
>
> I managed to compile gcc 5.5.0, but when I went to use it to compile
> Gromacs, the compiler could not recognise the hardware, although the native
> gcc 4.8.5 had no problem.
> In 2016.4, I was able to specify which SIMD set to use, so this was not an
> issue.   In any case there was very little difference between gcc 5.5.0 and
> 4.8.5.  So I used 4.8.5 for 2018.
> Any ideas how to overcome this problem with 5.5.0?
>
> 
> Gromacs 2016.4
> 
>
> System: Protein/DNA complex, with 438,397 atoms (including waters/ions),
> 100 ps npt equilibration.
>
> Allowing Gromacs to choose how it wanted to allocate the hardware gave
>
> 8 tMPI ranks, 4 thread per rank, both GPU’s
>
> 12.4 ns/day
>
> When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s
>
> 12.2 ns/day
>
>
> Running on “real” cores only
>
> 4 tMPI ranks, 4 threads per rank, 2 GPU’s
>
> 10.2 ns/day
>
> 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on,
> but pinstride and pinoffset automatic)
>
> 10.6 ns/day
>
> 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning
> options:
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1
> -gpu_id 0 -pinoffset 0 -pinstride 2
>
> 12.3 ns/day
>
> Presumably, the gain here is because “pintstride 2” caused the job to run
> on the “real” (1,2,3…15) cores, and not on virtual cores.  The automatic
> pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are
> virtual and so gave only 10.6ns/day.
>
> ** So there very little gain from the second GPU, and very little gain
> from multithreading. **
>
> Using AVX_256 and not AVX2_256 with above command gave a small speed up
> (although using AVX instead of AVX2 for FFTW made things worse).
>
> 12.5 ns/day
>
>
> To compare with an Intel Xeon Silver system:
> 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no
> Hyperthreading), 64GB memory
> 2 x Geforce 1080’s (as used in the above tests)
>
> gcc 4.8.5
> Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and
> not by me).
>

AVX2_256 should give some benefit, but not a lot. (BTW, on Silver do not
use AVX_512, even on the Gold / 2FMA Skylake-X, when running with GPUs AVX2
tends to be is better.)


> 2 MPI ranks, 12 threads each rank, 2 GPU’s
>
> 11.7 ns/day
>
> 4 MPI ranks, 6 threads each rank, 2 GPU’s
>
> 13.0 ns/day
>
> 6 MPI ranks, 4 threads each rank, 2 GPU’s
>
> 14.0 ns/day
>

Similar effect as noted wrt Ryzen.


>
> To compare with the AMD machine, same number of cores
>
> 1 MPI rank, 16 threads, 1 GPU
>
> 11.2 ns/day
>

(Side-note: a bit apples and oranges comparison, isn't it?)


>
> —
> Gromacs 2018 rc1 (using gcc 4.8.5)
> —
>
> Using AVX_256
>

You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest
in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated
runs.


>
> In ‘classic’ mode, not using gpu for PME
>
> 8 tMPI ranks, 4 threads per rank, 2 GPU’s
>
> 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4)
>
> Now use a gpu for PME
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on
>
> used 1 tMPI rank, 32 OpenMP threads, 1 GPU
>
> 14.9 ns/day
>
> Forcing the program to use both GPU’s
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4
> -npme 1 -gputasks 0011 -nb gpu
>
> 18.5 ns/day
>
> Now with AVX2_128
>
> 19.0 ns/day
>
> Now force Dynamic Load Balancing
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4
> -npme 1 -gputasks 0011 -nb gpu -dlb yes
>

I would recommend *against* doing 

[gmx-users] Tests with Threadripper and dual gpu setup

2018-01-24 Thread Harry Mark Greenblatt
BS”D

In case anybody is interested we have tested Gromacs on a Threadripper machine 
with two GPU’s.

Hardware:

Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair H100i V2 
Liquid cooling
Asus Prime X399-A M/B
2 X Geforce GTX 1080 GPU’s
32 GB of 3200MHz memory
Samsung 850 Pro 512GB SSD

OS, software:

Centos 7.4, with 4.14 Kernel from ElRepo
gcc 4.8.5 and gcc 5.5.0
fftw 3.3.7 (AVX2 enabled)
Cuda 8
Gromacs 2016.4
Gromacs 2018-rc1 and final 2018.
Using thread-MPI


I managed to compile gcc 5.5.0, but when I went to use it to compile Gromacs, 
the compiler could not recognise the hardware, although the native gcc 4.8.5 
had no problem.
In 2016.4, I was able to specify which SIMD set to use, so this was not an 
issue.   In any case there was very little difference between gcc 5.5.0 and 
4.8.5.  So I used 4.8.5 for 2018.
Any ideas how to overcome this problem with 5.5.0?


Gromacs 2016.4


System: Protein/DNA complex, with 438,397 atoms (including waters/ions), 100 ps 
npt equilibration.

Allowing Gromacs to choose how it wanted to allocate the hardware gave

8 tMPI ranks, 4 thread per rank, both GPU’s

12.4 ns/day

When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s

12.2 ns/day


Running on “real” cores only

4 tMPI ranks, 4 threads per rank, 2 GPU’s

10.2 ns/day

1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, but 
pinstride and pinoffset automatic)

10.6 ns/day

1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning options:

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 
-gpu_id 0 -pinoffset 0 -pinstride 2

12.3 ns/day

Presumably, the gain here is because “pintstride 2” caused the job to run on 
the “real” (1,2,3…15) cores, and not on virtual cores.  The automatic pinstride 
above used cores [0,16], [1,17], [2,18]…[7,23], half of which are virtual and 
so gave only 10.6ns/day.

** So there very little gain from the second GPU, and very little gain from 
multithreading. **

Using AVX_256 and not AVX2_256 with above command gave a small speed up 
(although using AVX instead of AVX2 for FFTW made things worse).

12.5 ns/day


To compare with an Intel Xeon Silver system:
2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no Hyperthreading), 
64GB memory
2 x Geforce 1080’s (as used in the above tests)

gcc 4.8.5
Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and not by 
me).


2 MPI ranks, 12 threads each rank, 2 GPU’s

11.7 ns/day

4 MPI ranks, 6 threads each rank, 2 GPU’s

13.0 ns/day

6 MPI ranks, 4 threads each rank, 2 GPU’s

14.0 ns/day

To compare with the AMD machine, same number of cores

1 MPI rank, 16 threads, 1 GPU

11.2 ns/day

—
Gromacs 2018 rc1 (using gcc 4.8.5)
—

Using AVX_256

In ‘classic’ mode, not using gpu for PME

8 tMPI ranks, 4 threads per rank, 2 GPU’s

12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4)

Now use a gpu for PME

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on

used 1 tMPI rank, 32 OpenMP threads, 1 GPU

14.9 ns/day

Forcing the program to use both GPU’s

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 
-gputasks 0011 -nb gpu

18.5 ns/day

Now with AVX2_128

19.0 ns/day

Now force Dynamic Load Balancing

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 
-gputasks 0011 -nb gpu -dlb yes

20.1 ns/day

Now use more (8) tMPI ranks

gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 -npme 1 
-gputasks  -nb gpu -dlb yes

20.7 ns/day

And finally, using 2018 (AVX2_128) with the above command line

20.9 ns/day

Here are the final lines from the log file

Dynamic load balancing report:
 DLB was permanently on during the run per user request.
 Average load imbalance: 7.7%.
 The balanceable part of the MD step is 51%, load imbalance is computed from 
this.
 Part of the total run time spent waiting due to load imbalance: 3.9%.
 Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 %
 Average PME mesh/force load: 1.275
 Part of the total run time spent waiting due to PP/PME imbalance: 9.4 %

NOTE: 9.4 % performance was lost because the PME ranks
  had more work to do than the PP ranks.
  You might want to increase the number of PME ranks
  or increase the cut-off and the grid spacing.


 R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G

On 7 MPI ranks doing PP, each using 4 OpenMP threads, and
on 1 MPI rank doing PME, using 4 OpenMP threads

 Computing:  Num   Num  CallWall time Giga-Cycles
 Ranks Threads  Count  (s) total sum%
-
 Domain decomp. 74500  13.721   1306.196   2.9
 DD comm. load  74500   0.366 34.875   0.1
 DD comm. bounds74