Hello all, I've been meaning to report the following findings to Ray and the MATPOWER team for some time, but never got around doing it. Well, here it is in summarized form:

## Advertising

- There's a very cheap way to speedup the J \ F step (actually, the sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher for bigger cases. It just involves invoking the MATLAB solver that is best suited for power system matrices (more below). - GPU computing, at least in its current state, is not a panacea for the powerflow problem, because the kind of sparse linear algebra that's needed in powerflow problems just doesn't offer enough opportunities to exploit GPU architectures. Try this simple change in *newtonpf.m*: marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m 89c89,93 < dx = -(J \ F); --- > %dx = -(J \ F); > % JLM: For performance, factorize using MATLAB's Gilbert-Peierls + AMD (see T. Davis) > Q = sparse(amd(J), 1:size(J,1), 1); % AMD reordering > [L, U, P] = lu(Q'*J*Q, 1.0); % this form ensures that G-P is called. Last arg: 0==no pivoting, 1.0==partial pivoting. > dx = - Q * ( U \ ( L \ (P * Q' * F))); My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups: % UMFPACK G-P speedup % ====================================================== % case300: 1.689e-02 1.230e-02 1.37 % case1354pegase: 5.848e-02 3.652e-02 1.60 % case1888rte: 2.792e-02 2.130e-02 1.31 % case1951rte: 3.037e-02 2.169e-02 1.40 % case2383wp: 1.429e-01 8.711e-02 1.64 % case2736sp: 1.327e-01 7.512e-02 1.76 % case2737sop: 1.632e-01 9.074e-02 1.79 % case2746wop: 1.589e-01 8.725e-02 1.82 % case2746wp: 1.448e-01 8.636e-02 1.67 % case2848rte: 4.652e-02 3.301e-02 1.40 % case2868rte: 1.681e-01 9.661e-02 1.74 % case2869pegase: 2.402e-01 1.335e-01 1.79 % case3012wp: 1.125e-01 6.733e-02 1.67 % case3120sp: 2.155e-01 1.228e-01 1.75 % case3375wp: 1.598e-01 9.565e-02 1.67 % case6468rte: 2.155e-01 1.443e-01 1.49 % case6470rte: 2.316e-01 1.434e-01 1.61 % case6495rte: 2.118e-01 1.433e-01 1.47 % case6515rte: 2.244e-01 1.443e-01 1.55 % case9241pegase: 1.185e+00 6.392e-01 1.85 % case13659pegase: 1.484e+00 7.991e-01 1.85 % % % max speedup: 1.85 % min speedup: 1.31 (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I ran each case 10 times, and kept the best timing) *Longer explanation:* The Jacobian matrices that one typically obtains in power networks are *very, very* sparse. They are very different than sparse matrices obtained in other areas (such as, e.g., finite-element modeling), where the matrix still has fairly dense blocks here and there (or at least it can be re-arranged to have them). Power system matrices, as well as many electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most, which is rather small. Now, the vast majority of research going into sparse LU solvers focuses on the former kind of matrices, i.e. those with large dense blocks. The methods go by names like "supernodal" and "multifrontal", and basically consist in designing the algorithm so that it can exploit the BLAS for processing those dense blocks. The point is that power system matrices are ill-suited to exploit the power of the BLAS, and therefore those methods are actually slower than the more straight-forward sparse solvers that don't try to use the BLAS at all. MATLAB contains sparse solvers of both types, but the problem is that it's not clearly documented how the backslash operator (or the lu call) uses internally one algorithm or the other. It turns out that the way you call lu() influences the algorithm being used!!! The default is UMFPACK, but if you use the "three-output, two-input args" call as above then you'll be invoking a Gilbert-Peierls method. As you may have guessed by now, UMFPACK is one of those methods using the BLAS, while G-P is not. As shown in the benchmarks above, UMFPACK clearly has an overhead. Coming back to the issue of GPUs, one can see why the potential performance gains for this type of sparse linear algebra are not as great as they seem. If you are interested in more details, I can provide references. See for instance SuiteSparse by Tim Davies, which in terms of sparse solvers is probably the best in class now (some of it is already integrated within MATLAB, btw). -- Jose L. Marin Grupo AIA 2016-11-30 11:02 GMT+01:00 davor sutic <davor.su...@gmail.com>: > Hello Jaroslaw, > I find your analysis very interesting. Would you mind sharing your > performance findings about a broader range of network sizes (e.g. 118, 300, > perhaps even those of a few thousand)? > > Also, are the times you stated the mean time per iteration of the total > running times of the 1000 iterations? > > Thanks a lot > Davor > > On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <j.kr...@uq.edu.au> wrote: > >> Hello Matpower Team, >> >> >> I have a quick question related to runpf calculation time. >> >> Some time ago I started to use matpower to calculate ac power flow. >> >> Due to the fact that I need thousands of power flows I wish to run it >> effectively. >> >> >> I run Matlab Profiler and I have found out that there is a lot of >> computation time taken by simple algebra. >> >> I.e. (calculation time for 1000 iterations of runpf of 49 bus system): >> >> Function Calls Total time Self time >> runpf 1000 1.692 0.183 s >> >> newtonpf 1000 0.684 0.552 s >> >> >> If you have a closer look into newtonpf, the 50% of time is taken by "dx >> = -(J \ F);" operation! >> >> Similarly, dSbus_dV consumes >60% of time by only two lines: >> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * diagVnorm; >> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);" >> >> These 3 lines take in total 0.420 out of 1.692 second. >> >> >> Have you considered to move this algebra to CUDA GPU? >> >> Are there any known constraints not to use CUDA? >> >> >> https://developer.nvidia.com/matlab-cuda >> >> https://www.youtube.com/watch?v=PwCAWxPcDkw >> >> >> >> BR, >> >> Jaroslaw >> >> >> >>