Re: RunPF vs GPU calculations

Jose Luis Marín Thu, 01 Dec 2016 11:43:36 -0800

Ooops, sorry, they're *seconds* (for one runpf() run).

Here's the benchmarking script, it's pretty simple:


case_list = {'case300', 'case1354pegase', 'case1888rte', 'case1951rte', ...
            'case2383wp', 'case2736sp', 'case2737sop', 'case2746wop', ...
            'case2746wp', 'case2848rte', 'case2868rte', 'case2869pegase',
...
            'case3012wp', 'case3120sp', 'case3375wp', ...
            'case6468rte', 'case6470rte', 'case6495rte', 'case6515rte', ...
            'case9241pegase', 'case13659pegase'};

opts = mpoption('pf.alg', 'NR', 'pf.tol', 1.0e-10, 'pf.nr.max_it', 100,
'out.all', 0, 'verbose', 0);
NREPEAT = 10;
timings = zeros(NREPEAT,1);


for k = 1:length(case_list)

    cfile = char(case_list(k));
    fprintf('*** BENCHMARKING %s:\t', cfile);
    mpc = loadcase(cfile);
    for i = 1:NREPEAT;
        tic;
        mpsol = runpf(mpc, opts);
        timings(i) = toc;
    end
    fprintf('%11.3e\n', min(timings));
end


-- 
Jose L. Marin
Grupo AIA



2016-12-01 19:01 GMT+01:00 davor sutic <[email protected]>:

> Hi Jose,
> thanks for sharing. I have one quick question: What are the units of the
> your benchmark (seconds, ticks, etc)?
>
> Thanks a lot
>
> On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín <[email protected]> wrote:
>
>> Hello all,
>>
>> I've been meaning to report the following findings to Ray and the
>> MATPOWER team for some time, but never got around doing it.  Well, here it
>> is in summarized form:
>>
>>    - There's a very cheap way to speedup the J \ F step (actually, the
>>    sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe 
>> higher
>>    for bigger cases.  It just involves invoking the MATLAB solver that is 
>> best
>>    suited for power system matrices (more below).
>>    - GPU computing, at least in its current state, is not a panacea for
>>    the powerflow problem, because the kind of sparse linear algebra that's
>>    needed in powerflow problems just doesn't offer enough opportunities to
>>    exploit GPU architectures.
>>
>>
>> Try this simple change in *newtonpf.m*:
>>
>> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
>> 89c89,93
>> <     dx = -(J \ F);
>> ---
>> >     %dx = -(J \ F);
>> >     % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
>> AMD (see T. Davis)
>> >     Q = sparse(amd(J), 1:size(J,1), 1);  % AMD reordering
>> >     [L, U, P] = lu(Q'*J*Q, 1.0);  % this form ensures that G-P is
>> called. Last arg: 0==no pivoting, 1.0==partial pivoting.
>> >     dx = - Q * ( U \ ( L \ (P * Q' * F)));
>>
>>
>> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:
>>
>> %                    UMFPACK        G-P        speedup
>> % ======================================================
>> % case300:          1.689e-02    1.230e-02      1.37
>> % case1354pegase:   5.848e-02    3.652e-02      1.60
>> % case1888rte:      2.792e-02    2.130e-02      1.31
>> % case1951rte:      3.037e-02    2.169e-02      1.40
>> % case2383wp:       1.429e-01    8.711e-02      1.64
>> % case2736sp:       1.327e-01    7.512e-02      1.76
>> % case2737sop:      1.632e-01    9.074e-02      1.79
>> % case2746wop:      1.589e-01    8.725e-02      1.82
>> % case2746wp:       1.448e-01    8.636e-02      1.67
>> % case2848rte:      4.652e-02    3.301e-02      1.40
>> % case2868rte:      1.681e-01    9.661e-02      1.74
>> % case2869pegase:   2.402e-01    1.335e-01      1.79
>> % case3012wp:       1.125e-01    6.733e-02      1.67
>> % case3120sp:       2.155e-01    1.228e-01      1.75
>> % case3375wp:       1.598e-01    9.565e-02      1.67
>> % case6468rte:      2.155e-01    1.443e-01      1.49
>> % case6470rte:      2.316e-01    1.434e-01      1.61
>> % case6495rte:      2.118e-01    1.433e-01      1.47
>> % case6515rte:      2.244e-01    1.443e-01      1.55
>> % case9241pegase:   1.185e+00    6.392e-01      1.85
>> % case13659pegase:  1.484e+00    7.991e-01      1.85
>> %
>> %
>> % max speedup: 1.85
>> % min speedup: 1.31
>>
>> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
>> ran each case 10 times, and kept the best timing)
>>
>>
>> *Longer explanation:*
>> The Jacobian matrices that one typically obtains in power networks are *very,
>> very* sparse.  They are very different than sparse matrices obtained in
>> other areas (such as, e.g., finite-element modeling), where the matrix
>> still has fairly dense blocks here and there (or at least it can be
>> re-arranged to have them).  Power system matrices, as well as many
>> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
>> which is rather small.  Now, the vast majority of research going into
>> sparse LU solvers focuses on the former kind of matrices, i.e. those with
>> large dense blocks.  The methods go by names like "supernodal" and
>> "multifrontal", and basically consist in designing the algorithm so that it
>> can exploit the BLAS for processing those dense blocks.  The point is that
>> power system matrices are ill-suited to exploit the power of the BLAS, and
>> therefore those methods are actually slower than the more straight-forward
>> sparse solvers that don't try to use the BLAS at all.
>>
>> MATLAB contains sparse solvers of both types, but the problem is that
>> it's not clearly documented how the backslash operator (or the lu call)
>> uses internally one algorithm or the other.  It turns out that the way you
>> call lu() influences the algorithm being used!!!  The default is UMFPACK,
>> but if you use the "three-output, two-input args" call as above then you'll
>> be invoking a Gilbert-Peierls method.  As you may have guessed by now,
>> UMFPACK is one of those methods using the BLAS, while G-P is not.  As shown
>> in the benchmarks above, UMFPACK clearly has an overhead.
>>
>> Coming back to the issue of GPUs, one can see why the potential
>> performance gains for this type of sparse linear algebra are not as great
>> as they seem.
>>
>> If you are interested in more details, I can provide references.  See for
>> instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
>> probably the best in class now  (some of it is already integrated within
>> MATLAB, btw).
>>
>> --
>> Jose L. Marin
>> Grupo AIA
>>
>>
>>
>>
>> 2016-11-30 11:02 GMT+01:00 davor sutic <[email protected]>:
>>
>>> Hello Jaroslaw,
>>> I find your analysis very interesting. Would you mind sharing your
>>> performance findings about a broader range of network sizes (e.g. 118, 300,
>>> perhaps even those of a few thousand)?
>>>
>>> Also, are the times you stated the mean time per iteration of the total
>>> running times of the 1000 iterations?
>>>
>>> Thanks a lot
>>> Davor
>>>
>>> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <[email protected]> wrote:
>>>
>>>> Hello Matpower Team,
>>>>
>>>>
>>>> I have a quick question related to runpf calculation time.
>>>>
>>>> Some time ago I started to use matpower to calculate ac power flow.
>>>>
>>>> Due to the fact that I need thousands of power flows I wish to run it
>>>> effectively.
>>>>
>>>>
>>>> I run Matlab Profiler and I have found out that there is a lot of
>>>> computation time taken by simple algebra.
>>>>
>>>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system):
>>>>
>>>> Function     Calls      Total time    Self time
>>>> runpf           1000     1.692            0.183 s
>>>>
>>>> newtonpf   1000     0.684             0.552 s
>>>>
>>>>
>>>> If you have a closer look into newtonpf, the 50% of time is taken by
>>>> "dx = -(J \ F);" operation!
>>>>
>>>> Similarly, dSbus_dV consumes >60% of time by only two lines:
>>>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) *
>>>> diagVnorm;
>>>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"
>>>>
>>>> These 3 lines take in total 0.420 out of 1.692 second.
>>>>
>>>>
>>>> Have you considered to move this algebra to CUDA GPU?
>>>>
>>>> Are there any known constraints not to use CUDA?
>>>>
>>>>
>>>> https://developer.nvidia.com/matlab-cuda
>>>>
>>>> https://www.youtube.com/watch?v=PwCAWxPcDkw
>>>>
>>>>
>>>>
>>>> BR,
>>>>
>>>> Jaroslaw
>>>>
>>>>
>>>>
>>>>
>>
>

Re: RunPF vs GPU calculations

Reply via email to