Thanks to all for the links to papers. A parallel implementation of either the primal or dual simplex method is a difficult research topic and not something I have nor had any motivation to undertake.
I'm interested in the optimization of the existing serial revised simplex algorithm with particular attention to low effort performance improvements. This leads to focusing on matrix multiply. Row - column operations are trivially data parallel. One can spread the problem over multiple cores by using modulo arithmetic to select the rows processed by each core. In a shared memory machine there is very low overhead for doing this. It also does not require making GLPK threadsafe. I've not counted how many functions would be affected by doing this, but it is very small and the changes are relatively simple. For dense rows the operation should be vectorized so that the SSE extensions can be used to full effect. The effort required is modest. As I recall padding the arrays will suffice, though compilers & hardware may take care of that now. For hyper sparse rows the use of a sparse multiply algorithm will outperform the vectorized version. So some logic is required to select which of the two methods is used and to keep track of the row density. The sort of multicore operation I'm suggesting will eventually be done automatically by the compiler. At present it's being done on a per core basis to schedule multiple FPUs. It may well already be available in some of the commercial HPC compilers, though with some restrictions. Attached processors such as the GPUs have been popular at various times. While they offer significant speedups, it comes at a price. Whether they are worthwhile depends upon the ratio of communication to computation time. They often demand significant code changes to use effectively. Historically, the CPU has caught up and the labor of coding to the AP has been rendered useless or even harmful. If the code accounting for 40% of the execution time can be speeded up by a factor of 4x, the total improvement will be a less than 30% reduction in execution time. I'd be delighted if the combination of vectorization, sparse arithmetic and multicore operation did better than that, but it's very hard to make large performance improvements in good codes. Most of the time improvements to one aspect of a code result in a bottleneck somewhere else. For large LP problems I would expect cache & main memory behavior to limit the gains from faster arithmetic. Have Fun! Reg _______________________________________________ Help-glpk mailing list [email protected] https://lists.gnu.org/mailman/listinfo/help-glpk
