I would next parallelize the function evaluation since it is the single 
largest consumer of time and should presumably be faster in parallel. After 
that revisit the -log_summary again to decide if the Jacobian evaluation can be 
improved.

   Barry

On Aug 30, 2013, at 5:28 PM, "Jin, Shuangshuang" <[email protected]> 
wrote:

> Hello, I'm trying to update some of my status here. I just managed to" 
> _distribute_ the work of computing the Jacobian matrix" as you suggested, so 
> each processor only computes a part of elements for the Jacobian matrix 
> instead of a global Jacobian matrix. I observed a reduction of the 
> computation time from 351 seconds to 55 seconds, which is much better but 
> still slower than I expected given the problem size is small. (4n functions 
> in IFunction, and 4n*4n Jacobian matrix in IJacobian, n = 288).
> 
> I looked at the log profile again, and saw that most of the computation time 
> are still for Functioan Eval and Jacobian Eval:
> 
> TSStep               600 1.0 5.6103e+01 1.0 9.42e+0825.6 3.0e+06 2.9e+02 
> 7.0e+04 93100 99 99 92 152100 99 99110   279
> TSFunctionEval      2996 1.0 2.9608e+01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 3.0e+04 30  0  0  0 39  50  0  0  0 47     0
> TSJacobianEval      1796 1.0 2.3436e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 
> 1.3e+04 39  0  0  0 16  64  0  0  0 20     0
> Warning -- total time of even greater than time of entire stage -- something 
> is wrong with the timer
> SNESSolve            600 1.0 5.5692e+01 1.1 9.42e+0825.7 3.0e+06 2.9e+02 
> 6.4e+04 88100 99 99 84 144100 99 99101   281
> SNESFunctionEval    2396 1.0 2.3715e+01 3.4 1.04e+06 1.0 0.0e+00 0.0e+00 
> 2.4e+04 25  0  0  0 31  41  0  0  0 38     1
> SNESJacobianEval    1796 1.0 2.3447e+01 1.0 0.00e+00 0.0 5.4e+02 3.8e+01 
> 1.3e+04 39  0  0  0 16  64  0  0  0 20     0
> SNESLineSearch      1796 1.0 1.8313e+01 1.0 1.54e+0831.4 4.9e+05 2.9e+02 
> 2.5e+04 30 16 16 16 33  50 16 16 16 39   139
> KSPGMRESOrthog      9090 1.0 1.1399e+00 4.1 1.60e+07 1.0 0.0e+00 0.0e+00 
> 9.1e+03  1  3  0  0 12   2  3  0  0 14   450
> KSPSetUp            3592 1.0 2.8342e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 3.0e+01  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve            1796 1.0 2.3052e+00 1.0 7.87e+0825.2 2.5e+06 2.9e+02 
> 2.0e+04  4 84 83 83 26   6 84 83 83 31  5680
> PCSetUp             3592 1.0 9.1255e-02 1.7 6.47e+05 2.5 0.0e+00 0.0e+00 
> 1.8e+01  0  0  0  0  0   0  0  0  0  0   159
> PCSetUpOnBlocks     1796 1.0 6.6802e-02 2.3 6.47e+05 2.5 0.0e+00 0.0e+00 
> 1.2e+01  0  0  0  0  0   0  0  0  0  0   217
> PCApply            10886 1.0 2.6064e-01 1.3 4.70e+06 1.5 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   1  1  0  0  0   481
> 
> I was wondering why SNESFunctionEval and SNESJacobianEval took over 23 
> seconds each, however, the KSPSolve only took 2.3 seconds, which is 10 times 
> faster. Is this normal? Do you have any more suggestion on how to reduce the 
> FunctionEval and JacobianEval time?
> (Currently in IFunction, my f function is sequentially formulated; in 
> IJacobian, the Jacobian matrix is distributed formulated).
> 
> Thanks,
> Shuangshuang
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Jed Brown [mailto:[email protected]] On Behalf Of Jed Brown
> Sent: Friday, August 16, 2013 5:00 PM
> To: Jin, Shuangshuang; Barry Smith; Shri ([email protected])
> Cc: [email protected]
> Subject: RE: [petsc-users] Performance of PETSc TS solver
> 
> "Jin, Shuangshuang" <[email protected]> writes:
> 
>>  
>> ////////////////////////////////////////////////////////////////////////////////////////
>>  // This proves to be the most time-consuming block in the computation:
>>  // Assign values to J matrix for the first 2*n rows (constant values)
>>  ... (skipped)
>> 
>>  // Assign values to J matrix for the following 2*n rows (depends on X 
>> values)
>>  for (i = 0; i < n; i++) {
>>    for (j = 0; j < n; j++) {
>>       ...(skipped)
> 
> This is a dense iteration.  Are the entries really mostly nonzero?  Why is 
> your i loop over all rows instead of only over xstart to xstart+xlen?
> 
>>  }
>> 
>> //////////////////////////////////////////////////////////////////////
>> //////////////////
>> 
>>  for (i = 0; i < 4*n; i++) {
>>    rowcol[i] = i;
>>  }
>> 
>>  // Compute function over the locally owned part of the grid
>>  for (i = xstart; i < xstart+xlen; i++) {
>>    ierr = MatSetValues(*B, 1, &i, 4*n, rowcol, &J[i][0], 
>> INSERT_VALUES); CHKERRQ(ierr);
> 
> This is seems to be creating a distributed dense matrix from a dense matrix J 
> of the global dimension.  Is that correct?  You need to _distribute_ the work 
> of computing the matrix entries if you want to see a speedup.

Reply via email to