# Re: RunPF vs GPU calculations

```Hello all,

I've been meaning to report the following findings to Ray and the MATPOWER
team for some time, but never got around doing it.  Well, here it is in
summarized form:```
```
- There's a very cheap way to speedup the J \ F step (actually, the
sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher
for bigger cases.  It just involves invoking the MATLAB solver that is best
suited for power system matrices (more below).
- GPU computing, at least in its current state, is not a panacea for the
powerflow problem, because the kind of sparse linear algebra that's needed
in powerflow problems just doesn't offer enough opportunities to exploit
GPU architectures.

Try this simple change in *newtonpf.m*:

marinjl@xwing:~/Library/matpower6.0b2\$ diff newtonpf.m.ORIG newtonpf.m
89c89,93
<     dx = -(J \ F);
---
>     %dx = -(J \ F);
>     % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
AMD (see T. Davis)
>     Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
>     [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is called.
Last arg: 0==no pivoting, 1.0==partial pivoting.
>     dx = - Q * ( U \ ( L \ (P * Q' * F)));

My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:

%                    UMFPACK        G-P        speedup
% ======================================================
% case300:          1.689e-02    1.230e-02      1.37
% case1354pegase:   5.848e-02    3.652e-02      1.60
% case1888rte:      2.792e-02    2.130e-02      1.31
% case1951rte:      3.037e-02    2.169e-02      1.40
% case2383wp:       1.429e-01    8.711e-02      1.64
% case2736sp:       1.327e-01    7.512e-02      1.76
% case2737sop:      1.632e-01    9.074e-02      1.79
% case2746wop:      1.589e-01    8.725e-02      1.82
% case2746wp:       1.448e-01    8.636e-02      1.67
% case2848rte:      4.652e-02    3.301e-02      1.40
% case2868rte:      1.681e-01    9.661e-02      1.74
% case2869pegase:   2.402e-01    1.335e-01      1.79
% case3012wp:       1.125e-01    6.733e-02      1.67
% case3120sp:       2.155e-01    1.228e-01      1.75
% case3375wp:       1.598e-01    9.565e-02      1.67
% case6468rte:      2.155e-01    1.443e-01      1.49
% case6470rte:      2.316e-01    1.434e-01      1.61
% case6495rte:      2.118e-01    1.433e-01      1.47
% case6515rte:      2.244e-01    1.443e-01      1.55
% case9241pegase:   1.185e+00    6.392e-01      1.85
% case13659pegase:  1.484e+00    7.991e-01      1.85
%
%
% max speedup: 1.85
% min speedup: 1.31

(note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
ran each case 10 times, and kept the best timing)

*Longer explanation:*
The Jacobian matrices that one typically obtains in power networks are *very,
very* sparse.  They are very different than sparse matrices obtained in
other areas (such as, e.g., finite-element modeling), where the matrix
still has fairly dense blocks here and there (or at least it can be
re-arranged to have them).  Power system matrices, as well as many
electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
which is rather small.  Now, the vast majority of research going into
sparse LU solvers focuses on the former kind of matrices, i.e. those with
large dense blocks.  The methods go by names like "supernodal" and
"multifrontal", and basically consist in designing the algorithm so that it
can exploit the BLAS for processing those dense blocks.  The point is that
power system matrices are ill-suited to exploit the power of the BLAS, and
therefore those methods are actually slower than the more straight-forward
sparse solvers that don't try to use the BLAS at all.

MATLAB contains sparse solvers of both types, but the problem is that it's
not clearly documented how the backslash operator (or the lu call) uses
internally one algorithm or the other.  It turns out that the way you call
lu() influences the algorithm being used!!!  The default is UMFPACK, but if
you use the "three-output, two-input args" call as above then you'll be
invoking a Gilbert-Peierls method.  As you may have guessed by now, UMFPACK
is one of those methods using the BLAS, while G-P is not.  As shown in the
benchmarks above, UMFPACK clearly has an overhead.

Coming back to the issue of GPUs, one can see why the potential performance
gains for this type of sparse linear algebra are not as great as they seem.

If you are interested in more details, I can provide references.  See for
instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
probably the best in class now  (some of it is already integrated within
MATLAB, btw).

--
Jose L. Marin
Grupo AIA

2016-11-30 11:02 GMT+01:00 davor sutic <davor.su...@gmail.com>:

> Hello Jaroslaw,
> I find your analysis very interesting. Would you mind sharing your
> performance findings about a broader range of network sizes (e.g. 118, 300,
> perhaps even those of a few thousand)?
>
> Also, are the times you stated the mean time per iteration of the total
> running times of the 1000 iterations?
>
> Thanks a lot
> Davor
>
> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <j.kr...@uq.edu.au> wrote:
>
>> Hello Matpower Team,
>>
>>
>> I have a quick question related to runpf calculation time.
>>
>> Some time ago I started to use matpower to calculate ac power flow.
>>
>> Due to the fact that I need thousands of power flows I wish to run it
>> effectively.
>>
>>
>> I run Matlab Profiler and I have found out that there is a lot of
>> computation time taken by simple algebra.
>>
>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system):
>>
>> Function     Calls      Total time    Self time
>> runpf           1000     1.692            0.183 s
>>
>> newtonpf   1000     0.684             0.552 s
>>
>>
>> If you have a closer look into newtonpf, the 50% of time is taken by "dx
>> = -(J \ F);" operation!
>>
>> Similarly, dSbus_dV consumes >60% of time by only two lines:
>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm;
>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"
>>
>> These 3 lines take in total 0.420 out of 1.692 second.
>>
>>
>> Have you considered to move this algebra to CUDA GPU?
>>
>> Are there any known constraints not to use CUDA?
>>
>>
>> https://developer.nvidia.com/matlab-cuda
>>