Ooops, sorry, they're *seconds* (for one runpf() run).
Here's the benchmarking script, it's pretty simple:
case_list = {'case300', 'case1354pegase', 'case1888rte', 'case1951rte', ...
'case2383wp', 'case2736sp', 'case2737sop', 'case2746wop', ...
'case2746wp', 'case2848rte', 'case2868rte', 'case2869pegase',
...
'case3012wp', 'case3120sp', 'case3375wp', ...
'case6468rte', 'case6470rte', 'case6495rte', 'case6515rte', ...
'case9241pegase', 'case13659pegase'};
opts = mpoption('pf.alg', 'NR', 'pf.tol', 1.0e-10, 'pf.nr.max_it', 100,
'out.all', 0, 'verbose', 0);
NREPEAT = 10;
timings = zeros(NREPEAT,1);
for k = 1:length(case_list)
cfile = char(case_list(k));
fprintf('*** BENCHMARKING %s:\t', cfile);
mpc = loadcase(cfile);
for i = 1:NREPEAT;
tic;
mpsol = runpf(mpc, opts);
timings(i) = toc;
end
fprintf('%11.3e\n', min(timings));
end
--
Jose L. Marin
Grupo AIA
2016-12-01 19:01 GMT+01:00 davor sutic <[email protected]>:
> Hi Jose,
> thanks for sharing. I have one quick question: What are the units of the
> your benchmark (seconds, ticks, etc)?
>
> Thanks a lot
>
> On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín <[email protected]> wrote:
>
>> Hello all,
>>
>> I've been meaning to report the following findings to Ray and the
>> MATPOWER team for some time, but never got around doing it. Well, here it
>> is in summarized form:
>>
>> - There's a very cheap way to speedup the J \ F step (actually, the
>> sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe
>> higher
>> for bigger cases. It just involves invoking the MATLAB solver that is
>> best
>> suited for power system matrices (more below).
>> - GPU computing, at least in its current state, is not a panacea for
>> the powerflow problem, because the kind of sparse linear algebra that's
>> needed in powerflow problems just doesn't offer enough opportunities to
>> exploit GPU architectures.
>>
>>
>> Try this simple change in *newtonpf.m*:
>>
>> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m
>> 89c89,93
>> < dx = -(J \ F);
>> ---
>> > %dx = -(J \ F);
>> > % JLM: For performance, factorize using MATLAB's Gilbert-Peierls +
>> AMD (see T. Davis)
>> > Q = sparse(amd(J), 1:size(J,1), 1); % AMD reordering
>> > [L, U, P] = lu(Q'*J*Q, 1.0); % this form ensures that G-P is
>> called. Last arg: 0==no pivoting, 1.0==partial pivoting.
>> > dx = - Q * ( U \ ( L \ (P * Q' * F)));
>>
>>
>> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups:
>>
>> % UMFPACK G-P speedup
>> % ======================================================
>> % case300: 1.689e-02 1.230e-02 1.37
>> % case1354pegase: 5.848e-02 3.652e-02 1.60
>> % case1888rte: 2.792e-02 2.130e-02 1.31
>> % case1951rte: 3.037e-02 2.169e-02 1.40
>> % case2383wp: 1.429e-01 8.711e-02 1.64
>> % case2736sp: 1.327e-01 7.512e-02 1.76
>> % case2737sop: 1.632e-01 9.074e-02 1.79
>> % case2746wop: 1.589e-01 8.725e-02 1.82
>> % case2746wp: 1.448e-01 8.636e-02 1.67
>> % case2848rte: 4.652e-02 3.301e-02 1.40
>> % case2868rte: 1.681e-01 9.661e-02 1.74
>> % case2869pegase: 2.402e-01 1.335e-01 1.79
>> % case3012wp: 1.125e-01 6.733e-02 1.67
>> % case3120sp: 2.155e-01 1.228e-01 1.75
>> % case3375wp: 1.598e-01 9.565e-02 1.67
>> % case6468rte: 2.155e-01 1.443e-01 1.49
>> % case6470rte: 2.316e-01 1.434e-01 1.61
>> % case6495rte: 2.118e-01 1.433e-01 1.47
>> % case6515rte: 2.244e-01 1.443e-01 1.55
>> % case9241pegase: 1.185e+00 6.392e-01 1.85
>> % case13659pegase: 1.484e+00 7.991e-01 1.85
>> %
>> %
>> % max speedup: 1.85
>> % min speedup: 1.31
>>
>> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I
>> ran each case 10 times, and kept the best timing)
>>
>>
>> *Longer explanation:*
>> The Jacobian matrices that one typically obtains in power networks are *very,
>> very* sparse. They are very different than sparse matrices obtained in
>> other areas (such as, e.g., finite-element modeling), where the matrix
>> still has fairly dense blocks here and there (or at least it can be
>> re-arranged to have them). Power system matrices, as well as many
>> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most,
>> which is rather small. Now, the vast majority of research going into
>> sparse LU solvers focuses on the former kind of matrices, i.e. those with
>> large dense blocks. The methods go by names like "supernodal" and
>> "multifrontal", and basically consist in designing the algorithm so that it
>> can exploit the BLAS for processing those dense blocks. The point is that
>> power system matrices are ill-suited to exploit the power of the BLAS, and
>> therefore those methods are actually slower than the more straight-forward
>> sparse solvers that don't try to use the BLAS at all.
>>
>> MATLAB contains sparse solvers of both types, but the problem is that
>> it's not clearly documented how the backslash operator (or the lu call)
>> uses internally one algorithm or the other. It turns out that the way you
>> call lu() influences the algorithm being used!!! The default is UMFPACK,
>> but if you use the "three-output, two-input args" call as above then you'll
>> be invoking a Gilbert-Peierls method. As you may have guessed by now,
>> UMFPACK is one of those methods using the BLAS, while G-P is not. As shown
>> in the benchmarks above, UMFPACK clearly has an overhead.
>>
>> Coming back to the issue of GPUs, one can see why the potential
>> performance gains for this type of sparse linear algebra are not as great
>> as they seem.
>>
>> If you are interested in more details, I can provide references. See for
>> instance SuiteSparse by Tim Davies, which in terms of sparse solvers is
>> probably the best in class now (some of it is already integrated within
>> MATLAB, btw).
>>
>> --
>> Jose L. Marin
>> Grupo AIA
>>
>>
>>
>>
>> 2016-11-30 11:02 GMT+01:00 davor sutic <[email protected]>:
>>
>>> Hello Jaroslaw,
>>> I find your analysis very interesting. Would you mind sharing your
>>> performance findings about a broader range of network sizes (e.g. 118, 300,
>>> perhaps even those of a few thousand)?
>>>
>>> Also, are the times you stated the mean time per iteration of the total
>>> running times of the 1000 iterations?
>>>
>>> Thanks a lot
>>> Davor
>>>
>>> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <[email protected]> wrote:
>>>
>>>> Hello Matpower Team,
>>>>
>>>>
>>>> I have a quick question related to runpf calculation time.
>>>>
>>>> Some time ago I started to use matpower to calculate ac power flow.
>>>>
>>>> Due to the fact that I need thousands of power flows I wish to run it
>>>> effectively.
>>>>
>>>>
>>>> I run Matlab Profiler and I have found out that there is a lot of
>>>> computation time taken by simple algebra.
>>>>
>>>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system):
>>>>
>>>> Function Calls Total time Self time
>>>> runpf 1000 1.692 0.183 s
>>>>
>>>> newtonpf 1000 0.684 0.552 s
>>>>
>>>>
>>>> If you have a closer look into newtonpf, the 50% of time is taken by
>>>> "dx = -(J \ F);" operation!
>>>>
>>>> Similarly, dSbus_dV consumes >60% of time by only two lines:
>>>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) *
>>>> diagVnorm;
>>>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);"
>>>>
>>>> These 3 lines take in total 0.420 out of 1.692 second.
>>>>
>>>>
>>>> Have you considered to move this algebra to CUDA GPU?
>>>>
>>>> Are there any known constraints not to use CUDA?
>>>>
>>>>
>>>> https://developer.nvidia.com/matlab-cuda
>>>>
>>>> https://www.youtube.com/watch?v=PwCAWxPcDkw
>>>>
>>>>
>>>>
>>>> BR,
>>>>
>>>> Jaroslaw
>>>>
>>>>
>>>>
>>>>
>>
>