Hi Jose, thanks for sharing. I have one quick question: What are the units of the your benchmark (seconds, ticks, etc)?

Thanks a lot On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín <mari...@aia.es> wrote: > Hello all, > > I've been meaning to report the following findings to Ray and the MATPOWER > team for some time, but never got around doing it. Well, here it is in > summarized form: > > - There's a very cheap way to speedup the J \ F step (actually, the > sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe higher > for bigger cases. It just involves invoking the MATLAB solver that is best > suited for power system matrices (more below). > - GPU computing, at least in its current state, is not a panacea for > the powerflow problem, because the kind of sparse linear algebra that's > needed in powerflow problems just doesn't offer enough opportunities to > exploit GPU architectures. > > > Try this simple change in *newtonpf.m*: > > marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m > 89c89,93 > < dx = -(J \ F); > --- > > %dx = -(J \ F); > > % JLM: For performance, factorize using MATLAB's Gilbert-Peierls + > AMD (see T. Davis) > > Q = sparse(amd(J), 1:size(J,1), 1); % AMD reordering > > [L, U, P] = lu(Q'*J*Q, 1.0); % this form ensures that G-P is > called. Last arg: 0==no pivoting, 1.0==partial pivoting. > > dx = - Q * ( U \ ( L \ (P * Q' * F))); > > > My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups: > > % UMFPACK G-P speedup > % ====================================================== > % case300: 1.689e-02 1.230e-02 1.37 > % case1354pegase: 5.848e-02 3.652e-02 1.60 > % case1888rte: 2.792e-02 2.130e-02 1.31 > % case1951rte: 3.037e-02 2.169e-02 1.40 > % case2383wp: 1.429e-01 8.711e-02 1.64 > % case2736sp: 1.327e-01 7.512e-02 1.76 > % case2737sop: 1.632e-01 9.074e-02 1.79 > % case2746wop: 1.589e-01 8.725e-02 1.82 > % case2746wp: 1.448e-01 8.636e-02 1.67 > % case2848rte: 4.652e-02 3.301e-02 1.40 > % case2868rte: 1.681e-01 9.661e-02 1.74 > % case2869pegase: 2.402e-01 1.335e-01 1.79 > % case3012wp: 1.125e-01 6.733e-02 1.67 > % case3120sp: 2.155e-01 1.228e-01 1.75 > % case3375wp: 1.598e-01 9.565e-02 1.67 > % case6468rte: 2.155e-01 1.443e-01 1.49 > % case6470rte: 2.316e-01 1.434e-01 1.61 > % case6495rte: 2.118e-01 1.433e-01 1.47 > % case6515rte: 2.244e-01 1.443e-01 1.55 > % case9241pegase: 1.185e+00 6.392e-01 1.85 > % case13659pegase: 1.484e+00 7.991e-01 1.85 > % > % > % max speedup: 1.85 > % min speedup: 1.31 > > (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I > ran each case 10 times, and kept the best timing) > > > *Longer explanation:* > The Jacobian matrices that one typically obtains in power networks are *very, > very* sparse. They are very different than sparse matrices obtained in > other areas (such as, e.g., finite-element modeling), where the matrix > still has fairly dense blocks here and there (or at least it can be > re-arranged to have them). Power system matrices, as well as many > electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most, > which is rather small. Now, the vast majority of research going into > sparse LU solvers focuses on the former kind of matrices, i.e. those with > large dense blocks. The methods go by names like "supernodal" and > "multifrontal", and basically consist in designing the algorithm so that it > can exploit the BLAS for processing those dense blocks. The point is that > power system matrices are ill-suited to exploit the power of the BLAS, and > therefore those methods are actually slower than the more straight-forward > sparse solvers that don't try to use the BLAS at all. > > MATLAB contains sparse solvers of both types, but the problem is that it's > not clearly documented how the backslash operator (or the lu call) uses > internally one algorithm or the other. It turns out that the way you call > lu() influences the algorithm being used!!! The default is UMFPACK, but if > you use the "three-output, two-input args" call as above then you'll be > invoking a Gilbert-Peierls method. As you may have guessed by now, UMFPACK > is one of those methods using the BLAS, while G-P is not. As shown in the > benchmarks above, UMFPACK clearly has an overhead. > > Coming back to the issue of GPUs, one can see why the potential > performance gains for this type of sparse linear algebra are not as great > as they seem. > > If you are interested in more details, I can provide references. See for > instance SuiteSparse by Tim Davies, which in terms of sparse solvers is > probably the best in class now (some of it is already integrated within > MATLAB, btw). > > -- > Jose L. Marin > Grupo AIA > > > > > 2016-11-30 11:02 GMT+01:00 davor sutic <davor.su...@gmail.com>: > >> Hello Jaroslaw, >> I find your analysis very interesting. Would you mind sharing your >> performance findings about a broader range of network sizes (e.g. 118, 300, >> perhaps even those of a few thousand)? >> >> Also, are the times you stated the mean time per iteration of the total >> running times of the 1000 iterations? >> >> Thanks a lot >> Davor >> >> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <j.kr...@uq.edu.au> wrote: >> >>> Hello Matpower Team, >>> >>> >>> I have a quick question related to runpf calculation time. >>> >>> Some time ago I started to use matpower to calculate ac power flow. >>> >>> Due to the fact that I need thousands of power flows I wish to run it >>> effectively. >>> >>> >>> I run Matlab Profiler and I have found out that there is a lot of >>> computation time taken by simple algebra. >>> >>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system): >>> >>> Function Calls Total time Self time >>> runpf 1000 1.692 0.183 s >>> >>> newtonpf 1000 0.684 0.552 s >>> >>> >>> If you have a closer look into newtonpf, the 50% of time is taken by "dx >>> = -(J \ F);" operation! >>> >>> Similarly, dSbus_dV consumes >60% of time by only two lines: >>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * >>> diagVnorm; >>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);" >>> >>> These 3 lines take in total 0.420 out of 1.692 second. >>> >>> >>> Have you considered to move this algebra to CUDA GPU? >>> >>> Are there any known constraints not to use CUDA? >>> >>> >>> https://developer.nvidia.com/matlab-cuda >>> >>> https://www.youtube.com/watch?v=PwCAWxPcDkw >>> >>> >>> >>> BR, >>> >>> Jaroslaw >>> >>> >>> >>> >