RE: [NMusers] GPU port for NONMEM

Amr Ragab Mon, 21 Mar 2011 15:28:19 -0700

Hey Chee,

Its good to hear that the conversation didn't fall off the radar and thank
you for your efforts in GPU. Looks like we have been working in parallel (no
pun intended) to get NONMEM GPU friendly. I am on the GPU here, I am not
familiar with the FX3800M, on our side I have a NVIDIA FX5800 and a Tesla
2060 Compute Processor. I will have to get concrete run times values, but
The Tesla 2060 is capable of double precision math arguments  the FX series
is not (GPU for games vs GPU for calculations). Not all GPUs are the same as
you can tell. When it comes to your PGI Fortran, you are correct that it
doesn't run in GPU but I am at the first steps here so once I see more on
the back end code of NONMEN then I can see opportunities to port fully to
NONMEM. I just wanted to see what the bottle neck. Another minor improvement
is disk cache access time, if you run NONMEM with your CTL stream and
datafile on an SSD you get a slight performance benefit.


 

Thank you

Amr 

 

From: owner-nmus...@globomaxnm.com [mailto:owner-nmus...@globomaxnm.com] On
Behalf Of chee ng
Sent: Monday, March 21, 2011 4:45 PM
To: nmusers@globomaxnm.com
Subject: Re: [NMusers] GPU port for NONMEM

 

Hi Mark, Xavier and all,

  Thank you for sharing the information.  Someone pointed me to this
interesting discussion about the GPU and NLME in NMuser group.

  Recently, I developed a prototype of GPU-based QRPEM (Quassi-Random
Parameteric EM) algorithm  in a single laptop computer equipped with an
INTEL Core i7-920 (2.6GHz) Extreme Quad-core processor and a NVIDIA Quadro
FX3800M video graphic card that contained 128 stream processors (hopefully
the PAGE will accept my poster for presentation in Greece this year). Using
a simple one-compartment PK model, my results is very similar to those
obtained by Xavier in SAEM, the GPU computation (based on a single graphic
card)  was close to 20x speed increase compared to the CPU (in this case
INTEL Core i7-920 extreme processor which was one of the fastest CPU for
laptop).  In addition, the GPU has a much better scaling relationship
between computation times and number of random samples (Nmc) used to compute
the E-step of the QRPEM algorithm.  By increasing the Nmc from 1000 to
20000, the mean computation times (with 30 iterations) increased from 2.9 to
38 min for CPU-based QRPEM, but only from 0.5 to 1.9 min for GPU-based
QRPEM.  I suspect that GPU computing will become more efficient for a more
complex population PK/PD model and currently developing a ODE for GPU so I
can test this hypothesis.   

 

Because a different programming logic and limited memory that has to be
shared by many stream processors (can be up to 448 processor cores with 6GB
memory for a single Tesla GPU 2070 card ), GPU demand a smart and efficient
programming and upfront thinking about the implemented numerical algorithms
(let think about the comparison between hybrid car and V-8 GM hummer).
Based on my current understandings, I agreed with Xavier that estimation
core of the NONMEM (and also S-ADAPT) would probably need be rewritten
almost from scratch.  Simply compiled the existing NONMEM code with PGI
fortran will not make the NONMEM run in GPU mode and you have to change the
source code in order to take advantage of the slim but efficient GPU
computing.  

 

Kind Regards,

 

Chee M Ng, PharmD, PhD, FCP

Children Hospital of Philadelphia

School of Medicine

University of Pennsylvania

Philadelphia, PA 19104

 

  _____  

From: Mark Sale - Next Level Solutions <m...@nextlevelsolns.com>
Cc: nmusers@globomaxnm.com
Sent: Thu, March 17, 2011 8:36:57 AM
Subject: RE: [NMusers] GPU port for NONMEM





Xavier,

  That is very exciting.  But, the question was, I think, about running
NONMEM on a GPU. My conclusion was that there isn't a clear way to break the
NONMEM algorithm up into small enough pieces for GPU computing, not whether
a new app designed to run on GPU could be written. Your point about the
limited memory is a good one, a copy the entire memory space does not need
to be loaded for each core, only the part specific to that core, which might
be quite small (or quite large, in the case of NONMEM at least, some of the
arrays are very large, although this is dramatically better with the recent
dynamic sizing, something that wasn't available when we looked at GPU
computing).  But, I'll look forward to hearing more about your work, it
would be a very important result.

 

Mark Sale MD
President, Next Level Solutions, LLC
 <http://www.nextlevelsolns.com/> www.NextLevelSolns.com 
919-846-9185

A carbon-neutral company

See our real time solar energy production at:

 <http://enlighten.enphaseenergy.com/public/systems/aSDz2458>
http://enlighten.enphaseenergy.com/public/systems/aSDz2458

 

-------- Original Message --------
Subject: Re: [NMusers] GPU port for NONMEM
From: Xavier Woot de Trixhe < <mailto:xavier.wootdetri...@exprimo.com>
xavier.wootdetri...@exprimo.com>
Date: Thu, March 17, 2011 7:52 am
To:  <mailto:nmusers@globomaxnm.com> nmusers@globomaxnm.com

Hi,


    I have to disagree with your conclusions.

    Last year a POC software was implemented in CUDA to check the concept
rather than speculate about the subject.
    This POC used the SAEM algorithm an a fairly simple PK model: single
dose and analytical function.

     From this exercise it became clear that:
        - The data is a lot less dissimilar than you imply. All the data
fits in a 2D csv file which can hardly be qualified as a complex data
structure and as every individual has the same number of parameters...   
        - The computation of IPREDS for analytical models is almost trivial
and for ODE's one can start with simpler, less efficient (fixed step)
ODE-solvers and is be able to expect an improvement.
        - The limited "memory per core" is the memory shared by threads this
is used for memory optimization not to keep the whole problem. The main
memory on is up to 4 Gigs which far exceeds the "1-2 Gig per NONMEM run"
rule of thumb. 
               
   Our POC-soft was close to 30x speed increase compared to our reference
(CPU: core i7 940 @2.93 GHz | GPU: nVidia GTX 285 -$500-) without even using
within individual parallelisation. Which could in theory result in a total
240x speed-up. 
    Where we agree is that NONMEM would probably need be rewritten almost
from scratch.    

    So although it could (should) be done; from a programmers p.o.v. the
resulting soft would no longer be NONMEM.

K. Regards

Xavier

On 03/16/2011 03:00 AM, Mark Sale - Next Level Solutions wrote: 


GPU was considered in the work leading up to the upcoming parallel NONMEM.
GPU computing is intended essentially for very simple algorithms applied to
a large number of very similar data sets (things like bit shifting
everything left, or adding a number to every value of a matrix).  Really not
well suited to the NONMEM algorithm (at least the time consuming part, which
is the calculation of predictions).  Most GPUs don't have nearly enough
memory per core to run NONMEM in the way that it will be parallelized (which
basically runs the entire PRED, just on subset of the data).

We tried really hard to think of a way to do NONMEM with GPU, even consulted
with a GPU computing expert at University of North Carolina, and couldn't
come up with a way to do it, and even if the memory requirement could be
addressed, would be a significant rewrite of NONMEM code.  But, mainly it
wasn't clear to us how the algorithm could be broken up into pieces small
enough for GPU computing.

 

But, never say never.

 

 

 

 

Mark Sale MD
President, Next Level Solutions, LLC
 <http://www.nextlevelsolns.com/> www.NextLevelSolns.com 
919-846-9185

A carbon-neutral company

See our real time solar energy production at:

 <http://enlighten.enphaseenergy.com/public/systems/aSDz2458>
http://enlighten.enphaseenergy.com/public/systems/aSDz2458

 

-------- Original Message --------
Subject: [NMusers] GPU port for NONMEM
From: "Amr Ragab" < <mailto:amrra...@gmail.com> amrra...@gmail.com>
Date: Tue, March 15, 2011 9:01 pm
To: < <mailto:nmusers@globomaxnm.com> nmusers@globomaxnm.com>

Hello NMUsers,

Wanted to ask what has been done in terms of utilizing the bank of various
GPU architecture available. Currently

I have access to a few Tesla NVIDIA cards(w/ double precision math). I know
PGI has worked out a solution using NVIDIA's CUDA toolkit

to create a C++/Fortran complier that uses GPU. Unfortunately NONMEM SETUP7
script doesn't include an install script

to utilize PGI's fortran. I managed to get through some Windows trickery for
NONMEM to think it opens the intel fortran compiler

but instead opens CUDA fortran....but it's not a permanent solution.

 

That of course would be first step, as the ideal solution would be that the
entire NONMEM program would be optimized for GPU. And it looks like

there is a performance benefit for large datasets or calling nmfe7 for
multiple runs..

 

Thanks

Amr

 

-- 
Exprimo NV
This e-mail is confidential. It is also privileged or otherwise protected by
work product immunity or other legal rules. The information is intended to
be for use of the individual or entity named above. If you are not the
intended recipient, please be aware that any disclosure, copying,
distribution or use of the contents of this information is prohibited. You
should therefore delete this message from your computer system. If you have
received the message in error, please notify us by reply e-mail. The
integrity and security of this message cannot be guaranteed on the Internet.
 
Thank you for your co-operation.

RE: [NMusers] GPU port for NONMEM

Reply via email to