Re: RunPF vs GPU calculations

davor sutic Thu, 01 Dec 2016 10:02:41 -0800

Hi Jose,
thanks for sharing. I have one quick question: What are the units of the
your benchmark (seconds, ticks, etc)?


Thanks a lot

On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín <mari...@aia.es> wrote:

> Hello all,
>
> I've been meaning to report the following findings to Ray and the MATPOWER
> team for some time, but never got around doing it.  Well, here it is in
> summarized form:
>
>    - There's a very cheap way to speedup the J \ F step (actually, the
>    sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher
>    for bigger cases.  It just involves invoking the MATLAB solver that is best
>    suited for power system matrices (more below).
>    - GPU computing, at least in its current state, is not a panacea for
>    the powerflow problem, because the kind of sparse linear algebra that's
>    needed in powerflow problems just doesn't offer enough opportunities to
>    exploit GPU architectures.
>
>
> Try this simple change in *newtonpf.m*:
>
> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
> 89c89,93
> <     dx = -(J \ F);
> ---
> >     %dx = -(J \ F);
> >     % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
> AMD (see T. Davis)
> >     Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
> >     [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is
> called. Last arg: 0==no pivoting, 1.0==partial pivoting.
> >     dx = - Q * ( U \ ( L \ (P * Q' * F)));
>
>
> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:
>
> %                    UMFPACK        G-P        speedup
> % ======================================================
> % case300:          1.689e-02    1.230e-02      1.37
> % case1354pegase:   5.848e-02    3.652e-02      1.60
> % case1888rte:      2.792e-02    2.130e-02      1.31
> % case1951rte:      3.037e-02    2.169e-02      1.40
> % case2383wp:       1.429e-01    8.711e-02      1.64
> % case2736sp:       1.327e-01    7.512e-02      1.76
> % case2737sop:      1.632e-01    9.074e-02      1.79
> % case2746wop:      1.589e-01    8.725e-02      1.82
> % case2746wp:       1.448e-01    8.636e-02      1.67
> % case2848rte:      4.652e-02    3.301e-02      1.40
> % case2868rte:      1.681e-01    9.661e-02      1.74
> % case2869pegase:   2.402e-01    1.335e-01      1.79
> % case3012wp:       1.125e-01    6.733e-02      1.67
> % case3120sp:       2.155e-01    1.228e-01      1.75
> % case3375wp:       1.598e-01    9.565e-02      1.67
> % case6468rte:      2.155e-01    1.443e-01      1.49
> % case6470rte:      2.316e-01    1.434e-01      1.61
> % case6495rte:      2.118e-01    1.433e-01      1.47
> % case6515rte:      2.244e-01    1.443e-01      1.55
> % case9241pegase:   1.185e+00    6.392e-01      1.85
> % case13659pegase:  1.484e+00    7.991e-01      1.85
> %
> %
> % max speedup: 1.85
> % min speedup: 1.31
>
> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
> ran each case 10 times, and kept the best timing)
>
>
> *Longer explanation:*
> The Jacobian matrices that one typically obtains in power networks are *very,
> very* sparse.  They are very different than sparse matrices obtained in
> other areas (such as, e.g., finite-element modeling), where the matrix
> still has fairly dense blocks here and there (or at least it can be
> re-arranged to have them).  Power system matrices, as well as many
> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
> which is rather small.  Now, the vast majority of research going into
> sparse LU solvers focuses on the former kind of matrices, i.e. those with
> large dense blocks.  The methods go by names like "supernodal" and
> "multifrontal", and basically consist in designing the algorithm so that it
> can exploit the BLAS for processing those dense blocks.  The point is that
> power system matrices are ill-suited to exploit the power of the BLAS, and
> therefore those methods are actually slower than the more straight-forward
> sparse solvers that don't try to use the BLAS at all.
>
> MATLAB contains sparse solvers of both types, but the problem is that it's
> not clearly documented how the backslash operator (or the lu call) uses
> internally one algorithm or the other.  It turns out that the way you call
> lu() influences the algorithm being used!!!  The default is UMFPACK, but if
> you use the "three-output, two-input args" call as above then you'll be
> invoking a Gilbert-Peierls method.  As you may have guessed by now, UMFPACK
> is one of those methods using the BLAS, while G-P is not.  As shown in the
> benchmarks above, UMFPACK clearly has an overhead.
>
> Coming back to the issue of GPUs, one can see why the potential
> performance gains for this type of sparse linear algebra are not as great
> as they seem.
>
> If you are interested in more details, I can provide references.  See for
> instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
> probably the best in class now  (some of it is already integrated within
> MATLAB, btw).
>
> --
> Jose L. Marin
> Grupo AIA
>
>
>
>
> 2016-11-30 11:02 GMT+01:00 davor sutic <davor.su...@gmail.com>:
>
>> Hello Jaroslaw,
>> I find your analysis very interesting. Would you mind sharing your
>> performance findings about a broader range of network sizes (e.g. 118, 300,
>> perhaps even those of a few thousand)?
>>
>> Also, are the times you stated the mean time per iteration of the total
>> running times of the 1000 iterations?
>>
>> Thanks a lot
>> Davor
>>
>> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <j.kr...@uq.edu.au> wrote:
>>
>>> Hello Matpower Team,
>>>
>>>
>>> I have a quick question related to runpf calculation time.
>>>
>>> Some time ago I started to use matpower to calculate ac power flow.
>>>
>>> Due to the fact that I need thousands of power flows I wish to run it
>>> effectively.
>>>
>>>
>>> I run Matlab Profiler and I have found out that there is a lot of
>>> computation time taken by simple algebra.
>>>
>>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system):
>>>
>>> Function     Calls      Total time    Self time
>>> runpf           1000     1.692            0.183 s
>>>
>>> newtonpf   1000     0.684             0.552 s
>>>
>>>
>>> If you have a closer look into newtonpf, the 50% of time is taken by "dx
>>> = -(J \ F);" operation!
>>>
>>> Similarly, dSbus_dV consumes >60% of time by only two lines:
>>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) *
>>> diagVnorm;
>>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"
>>>
>>> These 3 lines take in total 0.420 out of 1.692 second.
>>>
>>>
>>> Have you considered to move this algebra to CUDA GPU?
>>>
>>> Are there any known constraints not to use CUDA?
>>>
>>>
>>> https://developer.nvidia.com/matlab-cuda
>>>
>>> https://www.youtube.com/watch?v=PwCAWxPcDkw
>>>
>>>
>>>
>>> BR,
>>>
>>> Jaroslaw
>>>
>>>
>>>
>>>
>

Re: RunPF vs GPU calculations

Reply via email to