Dear Julia,

I am very happy to see that Merck is leveraging parallelized computing so 
significantly.
We did some systematic testing on parallelized modeling and published this 
recently.
AAPS Journal 2011, DOI: http://dx.doi.org/10.1208/s12248-011-9258-9
http://www.springerlink.com/content/c215433172002281/
The results on the efficiency of parallelization were generally in good 
agreement
with the testing that Chee and Bob presented for parallelized S-ADAPT earlier.

This was using the Importance Sampling EM algorithm (pmethod=4 in S-ADAPT;
equivalent to method IMPMAP in NONMEM).
For this example, parallelizing on 8 threads yielded a 6.9 times faster 
estimation and
parallelizing on 48 threads yielded a 23 times faster estimation. As the 
datasets had
48 subjects, each thread received one subject in the latter case and about 50% 
of
the estimation time was distributing data through the network.

The benefit of parallelizing increases significantly:
1) If the algorithm has a large (>99%) parallelizable fraction that can be 
distributed
among worker nodes. (IMPMAP is very well suited for this, MCMC is not,
FOCE should have a smaller parallelizable fraction than IMPMAP).
Example: A program with 50% parallelizable fraction can only be accelerated to
2-fold the single threaded speed, no matter how many cores one has.

2) If the dataset has many subjects. This is most critical for industry.

3) If the model is complex and requires differential equations. (Parallelizing 
a one
compartment model is unlikely to yield much benefit due to network traffic).

4) Bootstrap analyses are ideal to be distributed on the network. Bootstraps 
are best
run in single threaded though, as one can parallelize with 100% efficiency 
across the
1000 bootstrap replicates.

Some additional thoughts:
a) The larger your cluster, the more important it is to invest in the model 
code and
dataset debugger before a model is compiled, since you do not want to manually
shut-down the 2000 simultaneously running exe-files that might not have closed
properly. This is one of the key reasons why we invested significant time in
developing a free pre-processor for S-ADAPT.

b) If you have 2000 nodes, it may be worth to consider launching jobs from 
several
master nodes. You can run into trouble both with the available RAM and with the
network traffic, if everything needs to funnel through one master node, even if 
you
use 4x Infiniband networking, for example.

c) Creating a cuing system to prioritize jobs from different users and projects 
may help.
Your computational chemistry group must have a system like this.

d) Saving and analyzing intermediary results is most critical for large 
parallelized jobs.

Hope this provides some useful ideas. Overall, I think for (complex) models 
that require
differential equations, parallelizing will decide whether a project is feasible 
or not
in the time available. This is why we almost always parallelize.

Best wishes
Juergen



Jürgen B. Bulitta, Ph.D., Senior Scientist,
Ordway Research Institute,
150 New Scotland Avenue, Albany, NY 12208, USA
Phone: +1 (518) 641-6418, Fax: +1 (518) 641-6304
Email: [email protected]
http://www.ordwayresearch.org/profile_bulitta.html


From: [email protected] [mailto:[email protected]] On 
Behalf Of Ivashina, Julia
Sent: Friday, March 25, 2011 11:43 AM
To: [email protected]
Subject: [NMusers] NONMEM/PsN benchmark for SGE expansion

Dear all,

We would like to benchmark our new SGE cluster, and appreciate anyone who has 
performed a similar task and can share the findings.

We use NONMEM 7.1.2 with PsN 3.2.12 in two cluster environments.
Our older environment consists of 9 quad core machines (about 40 work nodes, 
counting the head node), and the newer one - over 2000 work nodes 512 CPU each.

These are the questions we'd like to answer:
·         What is a reasonable time one should expect to shave off by moving 
PK/PD analysis from the smaller cluster to the bigger one?
·         What type of analysis is the most sensitive to an increase in number 
of work nodes?
·         What should be the expected gain from increasing the number in 
-threads 50 times?
·         What parts of NONMEM/PsN are the most optimized for parallel 
execution?
·         What are the scenarios where gain from parallelization is the biggest?

The initial bootstrap test we've done showed some progress. Although, the model 
we chose did not run 50 times faster (2000/40=50).
Some of the reasons: pre-processing (creating of bootstrap  samples), Fortran 
compiler work, and combining of the results are not spread across work nodes.
Since the compute time for each of the job was small (5-10 seconds), the 
overhead of job submittal was more significant.

We also use vpc, npc, cdd, llp, sse and scm analysis, so would like to get some 
ideas on parallelization capability of these functions. Any benchmarking 
results or ideas that you can share is very much appreciated.

Thank you,
Julia






Notice:  This e-mail message, together with any attachments, contains

information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station,

New Jersey, USA 08889), and/or its affiliates Direct contact information

for affiliates is available at

http://www.merck.com/contact/contacts.html) that may be confidential,

proprietary copyrighted and/or legally privileged. It is intended solely

for the use of the individual or entity named on this message. If you are

not the intended recipient, and have received this message in error,

please notify us immediately by reply e-mail and then delete it from

your system.

Reply via email to