On Fri, Dec 02, 2011 at 11:54:20AM +0100, Fredrik Heffer Valdmanis wrote: > 2011/12/2 Fredrik Heffer Valdmanis <fredva at ifi.uio.no> > 2011/12/1 Matthew Knepley <knepley at gmail.com> > On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis > <fredva at ifi.uio.no> wrote: > > 2011/11/29 Matthew Knepley <knepley at gmail.com> > On Tue, Nov 29, 2011 at 10:37 AM, > Fredrik Heffer Valdmanis > <fredva at ifi.uio.no> wrote: > 2011/11/29 Matthew Knepley > <knepley at gmail.com> > On Tue, Nov 29, > 2011 at 2:38 AM, > Fredrik Heffer > Valdmanis > <fredva at ifi.uio.no> > wrote: > 2011/10/28 > Matthew > Knepley > <knepley at gmail.com> > On > Fri, > Oct > 28, > 2011 > at > 10:24 > AM, > Fredrik > Heffer > Valdmanis > <fredva at ifi.uio.no> > wrote: > Hi, > > I > am > working > on > integrating > the > new > GPU > based > vectors > and > matrices > into > FEniCS. > Now, > I'm > looking > at > the > possibility > for > getting > some > speedup > during > finite > element > assembly, > specifically > when > inserting > the > local > element > matrix > into > the > global > element > matrix. > In > that > regard, > I > have > a > few > questions > I > hope > you > can > help > me > out > with: > > - > When > calling > MatSetValues > with > a > MATSEQAIJCUSP > matrix > as > parameter, > what > exactly > is > it > that > happens? > As > far > as > I > can > see, > MatSetValues > is > not > implemented > for > GPU > based > matrices, > neither > is > the?mat- > >ops- > >setvalues > set > to > point > at > any > function > for > this > Mat > type.? > > Yes, > MatSetValues > always > operates > on > the > CPU > side. > It > would > not > make > sense > to do > individual > operations > on > the > GPU. > > I > have > written > batched > of > assembly > for > element > matrices > that > are > all > the > same > size: > > ??http: > // > www.mcs.anl.gov/ > petsc/ > petsc- > as/ > snapshots/ > petsc- > current/ > docs/ > manualpages/ > Mat/ > MatSetValuesBatch.html > ? > - > Is > it > such > that > matrices > are > assembled > in > their > entirety > on > the > CPU, > and > then > copied > over > to > the > GPU > (after > calling > MatAssemblyBegin)? > Or > are > values > copied > over > to > the > GPU > each > time > you > call > MatSetValues? > > That > function > assembles > the > matrix > on > the > GPU > and > then > copies > to > the > CPU. > The > only > time > you > do > not > want > this > copy > is > when > you > are > running > in > serial > and > never > touch > the > matrix > afterwards, > so I > left > it > in. > ? > - > Can > we > expect > to > see > any > speedup > from > using > MatSetValuesBatch > over > MatSetValues, > or > is > the > batch > version > simply > a > utility > function? > This > question > goes > for > both > CPU- > and > GPU- > based > matrices. > > CPU: > no > > GPU: > yes, > I see > about > the > memory > bandwidth > ratio > > > Hi, > > I have now > integrated > MatSetValuesBatch > in our > existing > PETSc > wrapper > layer. I > have > tested > matrix > assembly > with > Poisson's > equation > on > different > meshes > with > elements > of varying > order. I > have timed > the single > call to > MatSetValuesBatch > and > compared > that to > the total > time > consumed > by the > repeated > calls to > MatSetValues > in the old > implementation. > I have the > following > results: > > Poisson on > 1000x1000 > unit > square, > 1st order > Lagrange > elements: > MatSetValuesBatch: > 0.88576 s > repeated > calls to > MatSetValues: > 0.76654 s > > Poisson on > 500x500 > unit > square, > 2nd?order > Lagrange > elements: > MatSetValuesBatch: > 0.9324 s > repeated > calls to > MatSetValues: > 0.81644 s > > Poisson on > 300x300 > unit > square, > 3rd?order > Lagrange > elements: > MatSetValuesBatch: > 0.93988 s > repeated > calls to > MatSetValues: > 1.03884 s > > As you can > see, the > two > methods > take > almost the > same > amount of > time. > What?behavior?and > performance > should we > expect? Is > there any > way to > optimize > the > performance > of batched > assembly? > > Almost certainly it > is not dispatching > to the CUDA > version. The > regular version > just calls > MatSetValues() in a > loop. Are you > using a SEQAIJCUSP > matrix? > Yes. The same matrices yields > a speedup of 4-6x when > solving the system on the > GPU.? > > Please confirm that the correct routine > by running wth -info and sending the > output. > > Please send the output of -log_summary > so I can confirm the results. > > You can run KSP ex4 and reproduce my > results where I see a 5.5x speedup on > the GTX285 > > I am not sure what to look for in those outputs. > I have uploaded the output of running my assembly > program with -info and -log_summary, and the > output of running ex4 with -log_summary. See > > http://folk.uio.no/fredva/assembly_info.txt > http://folk.uio.no/fredva/ > assembly_log_summary.txt > http://folk.uio.no/fredva/ex4_log_summary.txt > > Trying this on a different machine now, I > actually see some speedup. 3rd order Poisson on > 300x300 assembles in 0.211 sec on GPU and 0.4232 > sec on CPU. For 1st order and 1000x1000 mesh, I > go from 0.31 sec to 0.205 sec.? > I have tried to increase the mesh size to see if > the speedup increases, but I hit the bad_alloc > error pretty quick.? > > For a problem of that size, should I expect even > more speedup? Please let me know if you need any > more output from test runs on my machine.? > > Here are my results for nxn grids where n = range(150, > 1350, 100). This is using a GTX 285. What card are you > using? > > I realize now that I was including the time it takes to construct the > large flattended array of values that is sent to MatSetValuesBatch. I > assume of course that you only time MatSetValues/MatSetValuesBatch > completely isolated. If I do this, I get significant speedup as well. > Sorry for the confusion here.? > > Still, this construction has to be done somehow in order to have > meaningful data to pass to MatSetValuesBatch. The way I do this is > apparently almost as costly as calling MatSetValues for each local > matrix. > > Have you got any ideas on how to speed up the construction of the > values array? This has to be done very efficiently in order for batch > assembly to yield any speedup overall.? > > Arg, disregard last transmission! I was confusing myself with timings from > several runs, and the "significant speedup" I referred to was seen > when I timed things very badly. The numbers from yesterdays mail are correct, > those were obtained using a GTX 280. That is, 30%-50% speedup on Poisson 2D on > different meshes.? > > The question from my previous email remains though, we need to speed up the > construction of the values array to get good speedup overall.? > > Sorry for the spamming,
Off-topic: I find this thread extremely hard to follow. Is Gmail required to read this list? The html-formatting with indentation (and no ">") makes it really hard to read in my email-client (mutt). -- Anders
