Here are a few comments on optimization and multicore GLPK. A search online for information about optimizing GLPK only turned up xypron's recent post about enabling SSE and fastmath. However, GLPK is very sophisticated, so I what I have to say may fall in the "not documented yet" category. However, except for Maros, I've seen little mention of performance profiling and coverage analysis in the LP literature I've read. Most of the performance metrics are limited to iteration counts and elapsed time. Serious code optimization requires much finer granularity of analysis.
I have just completed an initial, very quick read of "Computational Techniques of the Simplex Method" by Istvan Maros. For anyone interested in working on LP codes, I think it is a necessary, but not sufficient reference. Maros gives a good overview of many issues related to implementing a simplex code and to large scale solvers in general. It was typeset by the author so there are typos and non-native speaker errors to contend with. However, all the important ideas are accompanied by numerous citations, so with the addition of the appropriate references, it appears to provide a good map of the terrain. I qualify my statement only because I am far too ignorant of LP to make a stronger statement. I agree entirely with Andrew's comments previously on the topic of parallelization of GLPK. There is much work to be done on the serial algorithms before it makes any sense to attempt to implement parallel execution. Aside from the computational cost of pricing, the particular choice of pricing rule can strongly influence solution times for particular problems. So implementing Devex and other rules seems to me a good starting point. I'm sure that other opportunities to improve single core performance exist, but this seems a particularly obvious place to start. That said, barrier synchronized shared memory parallel programming should be suitable for use in GLPK running on multicore workstations. Actual performance is highly dependent upon cache architecture and organization, so the benefit cannot be predicted easily. Considerable experimentation and deep knowledge of the hardware behavior is required to make this work well. The only reason it is worth considering at all is that a single architecture dominates the computing landscape. Intel is unlikely to do something that breaks a lot of major codes and core counts are likely to continue to grow. The concept is simple. In sections which are not execution order dependent, a small number of coroutines are started by a signal. These coroutines then run to completion at which time they send a signal to the main routine which decrements a counter. When all the coroutines have completed, the counter decrements to zero and the main routine continues. If one can avoid thrashing due to cache collisions, those portions of the code can be speeded up by the number of cores employed. Of course, total speedup will be much less as dictated by Amdahl's law. In looking at some of the source code for simplex, I see places where there is a loop over vector operations. The natural thing to do for these is to use the SSE instructions to speed up the vector operations and spread the vectors over multiple cores. Though conceptually simple, none of this is easy to do. In addition, GLPK is easily the most complex code I've ever looked at in 20+ years of working on several million lines of scientific codes. Fortunately, it's also hands down the best written large code I've ever seen which is a real delight. Any optimization work on a code as sophisticated as GLPK is a major undertaking which will take a long time to execute. The first step is profiling and coverage analysis. If there is sufficient interest in this subject, I propose to implement and document performance profiling and coverage analysis of GLPK in the wikibook using the various Netlib problem sets. This will be for convenience restricted to the *nix environment. However, it should generally be applicable to Windows if a *nix environment package such as Cygwin is used. I am particularly interested in comments from Andrew, Marc, Robbie and xypron. All have been very generous to me in different ways and this is an attempt to repay them. I come from a seismic processing background where run times are measured in tens of thousands of hours for a single dataset. Fortunately, most seismic codes are trivially parallel. So a few hundred quad core nodes and a couple of weeks will get the job done. Were that not the case, oil would be a lot more expensive than it is. But we still do a lot of work to make the codes run faster. Have Fun! Reg _______________________________________________ Help-glpk mailing list [email protected] https://lists.gnu.org/mailman/listinfo/help-glpk
