Ooops, sorry, they're *seconds* (for one runpf() run). Here's the benchmarking script, it's pretty simple:

case_list = {'case300', 'case1354pegase', 'case1888rte', 'case1951rte', ... 'case2383wp', 'case2736sp', 'case2737sop', 'case2746wop', ... 'case2746wp', 'case2848rte', 'case2868rte', 'case2869pegase', ... 'case3012wp', 'case3120sp', 'case3375wp', ... 'case6468rte', 'case6470rte', 'case6495rte', 'case6515rte', ... 'case9241pegase', 'case13659pegase'}; opts = mpoption('pf.alg', 'NR', 'pf.tol', 1.0e-10, 'pf.nr.max_it', 100, 'out.all', 0, 'verbose', 0); NREPEAT = 10; timings = zeros(NREPEAT,1); for k = 1:length(case_list) cfile = char(case_list(k)); fprintf('*** BENCHMARKING %s:\t', cfile); mpc = loadcase(cfile); for i = 1:NREPEAT; tic; mpsol = runpf(mpc, opts); timings(i) = toc; end fprintf('%11.3e\n', min(timings)); end -- Jose L. Marin Grupo AIA 2016-12-01 19:01 GMT+01:00 davor sutic <davor.su...@gmail.com>: > Hi Jose, > thanks for sharing. I have one quick question: What are the units of the > your benchmark (seconds, ticks, etc)? > > Thanks a lot > > On Thu, Dec 1, 2016 at 6:52 PM, Jose Luis Marín <mari...@aia.es> wrote: > >> Hello all, >> >> I've been meaning to report the following findings to Ray and the >> MATPOWER team for some time, but never got around doing it. Well, here it >> is in summarized form: >> >> - There's a very cheap way to speedup the J \ F step (actually, the >> sparse LU decomposition) by factors of about x1.3 to x1.8, and maybe >> higher >> for bigger cases. It just involves invoking the MATLAB solver that is >> best >> suited for power system matrices (more below). >> - GPU computing, at least in its current state, is not a panacea for >> the powerflow problem, because the kind of sparse linear algebra that's >> needed in powerflow problems just doesn't offer enough opportunities to >> exploit GPU architectures. >> >> >> Try this simple change in *newtonpf.m*: >> >> marinjl@xwing:~/Library/matpower6.0b2$ diff newtonpf.m.ORIG newtonpf.m >> 89c89,93 >> < dx = -(J \ F); >> --- >> > %dx = -(J \ F); >> > % JLM: For performance, factorize using MATLAB's Gilbert-Peierls + >> AMD (see T. Davis) >> > Q = sparse(amd(J), 1:size(J,1), 1); % AMD reordering >> > [L, U, P] = lu(Q'*J*Q, 1.0); % this form ensures that G-P is >> called. Last arg: 0==no pivoting, 1.0==partial pivoting. >> > dx = - Q * ( U \ ( L \ (P * Q' * F))); >> >> >> My benchmarks (3GHz Intel Core 2 Duo) yield the following speedups: >> >> % UMFPACK G-P speedup >> % ====================================================== >> % case300: 1.689e-02 1.230e-02 1.37 >> % case1354pegase: 5.848e-02 3.652e-02 1.60 >> % case1888rte: 2.792e-02 2.130e-02 1.31 >> % case1951rte: 3.037e-02 2.169e-02 1.40 >> % case2383wp: 1.429e-01 8.711e-02 1.64 >> % case2736sp: 1.327e-01 7.512e-02 1.76 >> % case2737sop: 1.632e-01 9.074e-02 1.79 >> % case2746wop: 1.589e-01 8.725e-02 1.82 >> % case2746wp: 1.448e-01 8.636e-02 1.67 >> % case2848rte: 4.652e-02 3.301e-02 1.40 >> % case2868rte: 1.681e-01 9.661e-02 1.74 >> % case2869pegase: 2.402e-01 1.335e-01 1.79 >> % case3012wp: 1.125e-01 6.733e-02 1.67 >> % case3120sp: 2.155e-01 1.228e-01 1.75 >> % case3375wp: 1.598e-01 9.565e-02 1.67 >> % case6468rte: 2.155e-01 1.443e-01 1.49 >> % case6470rte: 2.316e-01 1.434e-01 1.61 >> % case6495rte: 2.118e-01 1.433e-01 1.47 >> % case6515rte: 2.244e-01 1.443e-01 1.55 >> % case9241pegase: 1.185e+00 6.392e-01 1.85 >> % case13659pegase: 1.484e+00 7.991e-01 1.85 >> % >> % >> % max speedup: 1.85 >> % min speedup: 1.31 >> >> (note: I used pf.tol=1.0e-10 in order to trigger a few more iterations. I >> ran each case 10 times, and kept the best timing) >> >> >> *Longer explanation:* >> The Jacobian matrices that one typically obtains in power networks are *very, >> very* sparse. They are very different than sparse matrices obtained in >> other areas (such as, e.g., finite-element modeling), where the matrix >> still has fairly dense blocks here and there (or at least it can be >> re-arranged to have them). Power system matrices, as well as many >> electronic circuit matrices, have dense blocks of sizes 3x3 or 4x4 at most, >> which is rather small. Now, the vast majority of research going into >> sparse LU solvers focuses on the former kind of matrices, i.e. those with >> large dense blocks. The methods go by names like "supernodal" and >> "multifrontal", and basically consist in designing the algorithm so that it >> can exploit the BLAS for processing those dense blocks. The point is that >> power system matrices are ill-suited to exploit the power of the BLAS, and >> therefore those methods are actually slower than the more straight-forward >> sparse solvers that don't try to use the BLAS at all. >> >> MATLAB contains sparse solvers of both types, but the problem is that >> it's not clearly documented how the backslash operator (or the lu call) >> uses internally one algorithm or the other. It turns out that the way you >> call lu() influences the algorithm being used!!! The default is UMFPACK, >> but if you use the "three-output, two-input args" call as above then you'll >> be invoking a Gilbert-Peierls method. As you may have guessed by now, >> UMFPACK is one of those methods using the BLAS, while G-P is not. As shown >> in the benchmarks above, UMFPACK clearly has an overhead. >> >> Coming back to the issue of GPUs, one can see why the potential >> performance gains for this type of sparse linear algebra are not as great >> as they seem. >> >> If you are interested in more details, I can provide references. See for >> instance SuiteSparse by Tim Davies, which in terms of sparse solvers is >> probably the best in class now (some of it is already integrated within >> MATLAB, btw). >> >> -- >> Jose L. Marin >> Grupo AIA >> >> >> >> >> 2016-11-30 11:02 GMT+01:00 davor sutic <davor.su...@gmail.com>: >> >>> Hello Jaroslaw, >>> I find your analysis very interesting. Would you mind sharing your >>> performance findings about a broader range of network sizes (e.g. 118, 300, >>> perhaps even those of a few thousand)? >>> >>> Also, are the times you stated the mean time per iteration of the total >>> running times of the 1000 iterations? >>> >>> Thanks a lot >>> Davor >>> >>> On Nov 30, 2016 6:26 AM, "Jaroslaw Krata" <j.kr...@uq.edu.au> wrote: >>> >>>> Hello Matpower Team, >>>> >>>> >>>> I have a quick question related to runpf calculation time. >>>> >>>> Some time ago I started to use matpower to calculate ac power flow. >>>> >>>> Due to the fact that I need thousands of power flows I wish to run it >>>> effectively. >>>> >>>> >>>> I run Matlab Profiler and I have found out that there is a lot of >>>> computation time taken by simple algebra. >>>> >>>> I.e. (calculation time for 1000 iterations of runpf of 49 bus system): >>>> >>>> Function Calls Total time Self time >>>> runpf 1000 1.692 0.183 s >>>> >>>> newtonpf 1000 0.684 0.552 s >>>> >>>> >>>> If you have a closer look into newtonpf, the 50% of time is taken by >>>> "dx = -(J \ F);" operation! >>>> >>>> Similarly, dSbus_dV consumes >60% of time by only two lines: >>>> "dSbus_dVm = diagV * conj(Ybus * diagVnorm) + conj(diagIbus) * >>>> diagVnorm; >>>> dSbus_dVa = 1j * diagV * conj(diagIbus - Ybus * diagV);" >>>> >>>> These 3 lines take in total 0.420 out of 1.692 second. >>>> >>>> >>>> Have you considered to move this algebra to CUDA GPU? >>>> >>>> Are there any known constraints not to use CUDA? >>>> >>>> >>>> https://developer.nvidia.com/matlab-cuda >>>> >>>> https://www.youtube.com/watch?v=PwCAWxPcDkw >>>> >>>> >>>> >>>> BR, >>>> >>>> Jaroslaw >>>> >>>> >>>> >>>> >> >