On Tue, Nov 12, 2013 at 2:48 PM, Roc Wang <[email protected]> wrote:
> > > ------------------------------ > Date: Tue, 12 Nov 2013 14:22:35 -0600 > Subject: Re: [petsc-users] approaches to reduce computing time > From: [email protected] > To: [email protected] > CC: [email protected]; [email protected] > > On Tue, Nov 12, 2013 at 2:14 PM, Roc Wang <[email protected]> wrote: > > Thanks Jed, > > I have questions about load balance and PC type below. > > > From: [email protected] > > To: [email protected]; [email protected] > > Subject: Re: [petsc-users] approaches to reduce computing time > > Date: Sun, 10 Nov 2013 12:20:18 -0700 > > > > Roc Wang <[email protected]> writes: > > > > > Hi all, > > > > > > I am trying to minimize the computing time to solve a large sparse > matrix. The matrix dimension is with m=321 n=321 and p=321. I am trying to > reduce the computing time from two directions: 1 finding a Pre-conditioner > to reduce the number of iterations which reduces the time numerically, 2 > requesting more cores. > > > > > > ----For the first method, I tried several methods: > > > 1 default KSP and PC, > > > 2 -ksp_type fgmres -ksp_gmres_restart 30 -pc_type ksp -ksp_pc_type > jacobi, > > > 3 -ksp_type lgmres -ksp_gmres_restart 40 -ksp_lgmres_augment 10, > > > 4 -ksp_type lgmres -ksp_gmres_restart 50 -ksp_lgmres_augment 10, > > > 5 -ksp_type lgmres -ksp_gmres_restart 40 -ksp_lgmres_augment 10 > -pc_type asm (PCASM) > > > > > > The iterations and timing is like the following with 128 cores > requested: > > > case# iter timing (s) > > > 1 1436 816 > > > 2 3 12658 > > > 3 1069 669.64 > > > 4 872 768.12 > > > 5 927 513.14 > > > > > > It can be seen that change -ksp_gmres_restart and -ksp_lgmres_augment > can help to reduce the iterations but not the timing (comparing case 3 and > 4). Second, the PCASM helps a lot. Although the second option is able to > reduce iterations, the timing increases very much. Is it because more > operations are needed in the PC? > > > > > > My questions here are: 1. Which direction should I take to select > > > -ksp_gmres_restart and -ksp_lgmres_augment? For example, if larger > > > restart with large augment is better or larger restart with smaller > > > augment is better? > > > > Look at the -log_summary. By increasing the restart, the work in > > KSPGMRESOrthog will increase linearly, but the number of iterations > > might decrease enough to compensate. There is no general rule here > > since it depends on the relative expense of operations for your problem > > on your machine. > > > > > ----For the second method, I tried with -ksp_type lgmres > -ksp_gmres_restart 40 -ksp_lgmres_augment 10 -pc_type asm with different > number of cores. I found the speedup ratio increases slowly when more than > 32 to 64 cores are requested. I searched the milling list archives and > found that I am very likely running into the memory bandwidth bottleneck. > http://www.mail-archive.com/[email protected]/msg19152.html: > > > > > > # of cores iter timing > > > 1 923 19541.83 > > > 4 929 5897.06 > > > 8 932 4854.72 > > > 16 924 1494.33 > > > 32 924 1480.88 > > > 64 928 686.89 > > > 128 927 627.33 > > > 256 926 552.93 > > > > The bandwidth issue has more to do with using multiple cores within a > > node rather than between nodes. Likely the above is a load balancing > > problem or bad communication. > > I use DM to manage the distributed data. The DM was created by calling > DMDACreate3d() and let PETSc decide the local number of nodes in each > direction. To my understand the load of each core is determined at this > stage. If the load balance is done when DMDACreate3d() is called and use > PETSC_DECIDE option? Or how should make the load balanced after DM is > created? > > > We do not have a way to do fine-grained load balancing for the DMDA since > it is intended for very simple topologies. You can see > if it is load imbalance from the division by running with a cube that is > evenly divisible with a cube number of processes. > > Matt > > So, I have nothing to do to make the load balanced if I use DMDA? Would > you please take a look at the attached log summary files and give me some > suggestions on how to improve the speedup ratio? Thanks. > Please try what I suggested above. And it looks like there is a little load imbalance VecAXPY 234 1.0 1.0124e+00 3.4 1.26e+08 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 15290 VecAXPY 234 1.0 4.2862e-01 3.6 6.37e+07 1.1 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 36115 although it is not limiting the speedup. The time imbalance is really strange. I am guessing other jobs are running on this machine. Matt > > > > > My question here is: Is there any other PC can help on both reducing > iterations and increasing scalability? Thanks. > > > > Always send -log_summary with questions like this, but algebraic > multigrid is a good place to start. > > Please take a look at the attached log file, they are for 128 cores and > 256 cores, respectively. Based on the log files, what should be done to > increase the scalability? Thanks. > > > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
