Hello, I'm trying to update some of my status here. I just managed to" 
_distribute_ the work of computing the Jacobian matrix" as you suggested, so 
each processor only computes a part of elements for the Jacobian matrix instead 
of a global Jacobian matrix. I observed a reduction of the computation time 
from 351 seconds to 55 seconds, which is much better but still slower than I 
expected given the problem size is small. (4n functions in IFunction, and 4n*4n 
Jacobian matrix in IJacobian, n = 288).

I looked at the log profile again, and saw that most of the computation time 
are still for Functioan Eval and Jacobian Eval:

TSStep               600 1.0 5.6103e+01 1.0 9.42e+0825.6 3.0e+06 2.9e+02 
7.0e+04 93100 99 99 92 152100 99 99110   279
TSFunctionEval      2996 1.0 2.9608e+01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
3.0e+04 30  0  0  0 39  50  0  0  0 47     0
TSJacobianEval      1796 1.0 2.3436e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 
1.3e+04 39  0  0  0 16  64  0  0  0 20     0
Warning -- total time of even greater than time of entire stage -- something is 
wrong with the timer
SNESSolve            600 1.0 5.5692e+01 1.1 9.42e+0825.7 3.0e+06 2.9e+02 
6.4e+04 88100 99 99 84 144100 99 99101   281
SNESFunctionEval    2396 1.0 2.3715e+01 3.4 1.04e+06 1.0 0.0e+00 0.0e+00 
2.4e+04 25  0  0  0 31  41  0  0  0 38     1
SNESJacobianEval    1796 1.0 2.3447e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 
1.3e+04 39  0  0  0 16  64  0  0  0 20     0
SNESLineSearch      1796 1.0 1.8313e+01 1.0 1.54e+0831.4 4.9e+05 2.9e+02 
2.5e+04 30 16 16 16 33  50 16 16 16 39   139
KSPGMRESOrthog      9090 1.0 1.1399e+00 4.1 1.60e+07 1.0 0.0e+00 0.0e+00 
9.1e+03  1  3  0  0 12   2  3  0  0 14   450
KSPSetUp            3592 1.0 2.8342e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
3.0e+01  0  0  0  0  0   0  0  0  0  0     0
KSPSolve            1796 1.0 2.3052e+00 1.0 7.87e+0825.2 2.5e+06 2.9e+02 
2.0e+04  4 84 83 83 26   6 84 83 83 31  5680
PCSetUp             3592 1.0 9.1255e-02 1.7 6.47e+05 2.5 0.0e+00 0.0e+00 
1.8e+01  0  0  0  0  0   0  0  0  0  0   159
PCSetUpOnBlocks     1796 1.0 6.6802e-02 2.3 6.47e+05 2.5 0.0e+00 0.0e+00 
1.2e+01  0  0  0  0  0   0  0  0  0  0   217
PCApply            10886 1.0 2.6064e-01 1.3 4.70e+06 1.5 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   1  1  0  0  0   481

I was wondering why SNESFunctionEval and SNESJacobianEval took over 23 seconds 
each, however, the KSPSolve only took 2.3 seconds, which is 10 times faster. Is 
this normal? Do you have any more suggestion on how to reduce the FunctionEval 
and JacobianEval time?
(Currently in IFunction, my f function is sequentially formulated; in 
IJacobian, the Jacobian matrix is distributed formulated).

Thanks,
Shuangshuang





-----Original Message-----
From: Jed Brown [mailto:[email protected]] On Behalf Of Jed Brown
Sent: Friday, August 16, 2013 5:00 PM
To: Jin, Shuangshuang; Barry Smith; Shri ([email protected])
Cc: [email protected]
Subject: RE: [petsc-users] Performance of PETSc TS solver

"Jin, Shuangshuang" <[email protected]> writes:

>   
> ////////////////////////////////////////////////////////////////////////////////////////
>   // This proves to be the most time-consuming block in the computation:
>   // Assign values to J matrix for the first 2*n rows (constant values)
>   ... (skipped)
>
>   // Assign values to J matrix for the following 2*n rows (depends on X 
> values)
>   for (i = 0; i < n; i++) {
>     for (j = 0; j < n; j++) {
>        ...(skipped)

This is a dense iteration.  Are the entries really mostly nonzero?  Why is your 
i loop over all rows instead of only over xstart to xstart+xlen?

>   }
>   
> //////////////////////////////////////////////////////////////////////
> //////////////////
>
>   for (i = 0; i < 4*n; i++) {
>     rowcol[i] = i;
>   }
>
>   // Compute function over the locally owned part of the grid
>   for (i = xstart; i < xstart+xlen; i++) {
>     ierr = MatSetValues(*B, 1, &i, 4*n, rowcol, &J[i][0], 
> INSERT_VALUES); CHKERRQ(ierr);

This is seems to be creating a distributed dense matrix from a dense matrix J 
of the global dimension.  Is that correct?  You need to _distribute_ the work 
of computing the matrix entries if you want to see a speedup.

Reply via email to