Dieter,

the observation that the 8-core run is slower than the 4-core run is probably not due CPU hyperthreading, as you suggest. The CPU loads that you report also suggest otherwise. I agree with Mark that it is more likely due to the short time per iteration, i.e. the relatively high amount of overhead compared to the actual calculations. We noticed the same when using FPI. Use MPI or test a slower model and this effect will probably disappear.

We also did some benchmarking, and noticed that NM7.2 can do pretty efficient parallelization. Our conclusions:
- MPI is much more efficient than FPI, especially for faster problems
- The efficiency with MPI seems to hold across estimation methods (FOCE / BAYES / SAEM) and models (8 tested), around 90% when using 5 cores. See results below. - Parallelization efficiency depends on e.g. time per iteration, transfer type, number of individuals in dataset. - parallelization (MPI) was still efficient at higher numbers of cores. We tested up to 7 cores on 1 machine. In some basic tests, performance over network-nodes seemed as good as when running on a single machine, although fair benchmarking is difficult on a production cluster.

We tested using the gfortran compiler, on a dedicated 8-core machine running Linux.

best regards,
Ron

--
-----------------------------------
Ron Keizer, PharmD PhD
Post-doctoral fellow
Pharmacometrics Research Group
Uppsala University
-----------------------------------



table1: multicore efficiency
| tt  | n cores | time_FOCE |   % | time_BAYES |   % |
|-----+---------+-----------+-----+------------+-----|
| -   |       1 |  13462.69 | 100 |    5283.78 | 100 |
| FPI |       2 |   7269.35 |  54 |    3096.51 |  58 |
| FPI |       3 |   5081.05 |  38 |    2470.52 |  46 |
| FPI |       4 |   4211.93 |  31 |    2709.43 |  51 |
| FPI |       5 |   3667.43 |  27 |     2729.8 |  51 |
| FPI |       6 |   3464.34 |  26 |    3254.91 |  61 |
|-----+---------+-----------+-----+------------+-----|
| -   |       1 |  13462.69 | 100 |    5283.78 | 100 |
| MPI |       2 |   7122.48 |  53 |    2731.38 |  51 |
| MPI |       3 |   4826.77 |  36 |    1853.94 |  35 |
| MPI |       4 |   3705.35 |  28 |    1464.69 |  27 |
| MPI |       5 |   2976.36 |  22 |    1179.11 |  22 |
| MPI |       6 |   2519.89 |  19 |    1011.94 |  19 |

table 2: efficiency across different models (distributed to 5 cores, t in sec)
| mo | model  | est   | n_ind | iter |   t_orig |  t_mpi5 |    t% |  eff% |
|----+--------+-------+-------+------+----------+---------+-------+-------|
| M1 | ADVAN6 | FOCEI |     9 |   16 |   5863.0 | 1881.88 |  32.1 | 62.31 |
| M2 | ADVAN6 | FOCEI |   454 |   28 |   4485.3 |  930.38 | 20.74 | 96.42 |
| M3 | ADVAN6 | FOCEI |   412 |   20 |   363.84 |   78.23 |  21.5 | 93.02 |
| M4 | ADVAN6 | FOCE  |   105 |  486 | 13616.83 | 2979.52 | 21.88 |  91.4 |
| M5 | ADVAN6 | FOCEI |    42 |   45 | 14183.92 | 3167.56 | 22.33 | 89.56 |
| M6 | ADVAN6 | FOCEI |    39 |   43 |  4698.34 |  992.52 | 21.12 | 94.67 |
| M7 | ADVAN6 | FOCE  |   100 |   29 |    33249 | 7493.82 | 22.54 | 88.74 |


On 5/20/11 9:36 PM, Dieter Menne wrote:
Here some quick-and-dirty results of my first benchmark with parallel
processing in NONMEM 7.2

Running Win7, 64 bit, intel i7, with 4 CPU (and 4 hyperthreading cores). One
computer only.

Using file message passing. Could not get mpi to work in this configuration.

call nmfe72 mtl_KPreM2Pre_T2L2_.ctl -parafile=fpiwini8.pnm [nodes]= (1 or 4
or 8)

10 iterations of a very large Bayes problem (which should not profit from
multiple cores, according to the manual)

nodes    time
1        45 s
4        25 s
8        40 s

So about a factor of 2 between 1 and 4 cores.

It is not surprising that 8 gives worse values because these are no real
CPUs. More surprising is the fact that with 8 "CPU", I have 100 load on all
of them (huh?), while with 4 CPUs, I have the expected 50%.

Dieter







Reply via email to