I would next parallelize the function evaluation since it is the single largest consumer of time and should presumably be faster in parallel. After that revisit the -log_summary again to decide if the Jacobian evaluation can be improved.
Barry On Aug 30, 2013, at 5:28 PM, "Jin, Shuangshuang" <[email protected]> wrote: > Hello, I'm trying to update some of my status here. I just managed to" > _distribute_ the work of computing the Jacobian matrix" as you suggested, so > each processor only computes a part of elements for the Jacobian matrix > instead of a global Jacobian matrix. I observed a reduction of the > computation time from 351 seconds to 55 seconds, which is much better but > still slower than I expected given the problem size is small. (4n functions > in IFunction, and 4n*4n Jacobian matrix in IJacobian, n = 288). > > I looked at the log profile again, and saw that most of the computation time > are still for Functioan Eval and Jacobian Eval: > > TSStep 600 1.0 5.6103e+01 1.0 9.42e+0825.6 3.0e+06 2.9e+02 > 7.0e+04 93100 99 99 92 152100 99 99110 279 > TSFunctionEval 2996 1.0 2.9608e+01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 3.0e+04 30 0 0 0 39 50 0 0 0 47 0 > TSJacobianEval 1796 1.0 2.3436e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 > 1.3e+04 39 0 0 0 16 64 0 0 0 20 0 > Warning -- total time of even greater than time of entire stage -- something > is wrong with the timer > SNESSolve 600 1.0 5.5692e+01 1.1 9.42e+0825.7 3.0e+06 2.9e+02 > 6.4e+04 88100 99 99 84 144100 99 99101 281 > SNESFunctionEval 2396 1.0 2.3715e+01 3.4 1.04e+06 1.0 0.0e+00 0.0e+00 > 2.4e+04 25 0 0 0 31 41 0 0 0 38 1 > SNESJacobianEval 1796 1.0 2.3447e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 > 1.3e+04 39 0 0 0 16 64 0 0 0 20 0 > SNESLineSearch 1796 1.0 1.8313e+01 1.0 1.54e+0831.4 4.9e+05 2.9e+02 > 2.5e+04 30 16 16 16 33 50 16 16 16 39 139 > KSPGMRESOrthog 9090 1.0 1.1399e+00 4.1 1.60e+07 1.0 0.0e+00 0.0e+00 > 9.1e+03 1 3 0 0 12 2 3 0 0 14 450 > KSPSetUp 3592 1.0 2.8342e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 3.0e+01 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1796 1.0 2.3052e+00 1.0 7.87e+0825.2 2.5e+06 2.9e+02 > 2.0e+04 4 84 83 83 26 6 84 83 83 31 5680 > PCSetUp 3592 1.0 9.1255e-02 1.7 6.47e+05 2.5 0.0e+00 0.0e+00 > 1.8e+01 0 0 0 0 0 0 0 0 0 0 159 > PCSetUpOnBlocks 1796 1.0 6.6802e-02 2.3 6.47e+05 2.5 0.0e+00 0.0e+00 > 1.2e+01 0 0 0 0 0 0 0 0 0 0 217 > PCApply 10886 1.0 2.6064e-01 1.3 4.70e+06 1.5 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 1 1 0 0 0 481 > > I was wondering why SNESFunctionEval and SNESJacobianEval took over 23 > seconds each, however, the KSPSolve only took 2.3 seconds, which is 10 times > faster. Is this normal? Do you have any more suggestion on how to reduce the > FunctionEval and JacobianEval time? > (Currently in IFunction, my f function is sequentially formulated; in > IJacobian, the Jacobian matrix is distributed formulated). > > Thanks, > Shuangshuang > > > > > > -----Original Message----- > From: Jed Brown [mailto:[email protected]] On Behalf Of Jed Brown > Sent: Friday, August 16, 2013 5:00 PM > To: Jin, Shuangshuang; Barry Smith; Shri ([email protected]) > Cc: [email protected] > Subject: RE: [petsc-users] Performance of PETSc TS solver > > "Jin, Shuangshuang" <[email protected]> writes: > >> >> //////////////////////////////////////////////////////////////////////////////////////// >> // This proves to be the most time-consuming block in the computation: >> // Assign values to J matrix for the first 2*n rows (constant values) >> ... (skipped) >> >> // Assign values to J matrix for the following 2*n rows (depends on X >> values) >> for (i = 0; i < n; i++) { >> for (j = 0; j < n; j++) { >> ...(skipped) > > This is a dense iteration. Are the entries really mostly nonzero? Why is > your i loop over all rows instead of only over xstart to xstart+xlen? > >> } >> >> ////////////////////////////////////////////////////////////////////// >> ////////////////// >> >> for (i = 0; i < 4*n; i++) { >> rowcol[i] = i; >> } >> >> // Compute function over the locally owned part of the grid >> for (i = xstart; i < xstart+xlen; i++) { >> ierr = MatSetValues(*B, 1, &i, 4*n, rowcol, &J[i][0], >> INSERT_VALUES); CHKERRQ(ierr); > > This is seems to be creating a distributed dense matrix from a dense matrix J > of the global dimension. Is that correct? You need to _distribute_ the work > of computing the matrix entries if you want to see a speedup.
