Hello all There has been some recent activity aimed at reducing memory "leaks" from within the Open MPI code base, including OpenRTE. These are most welcome and long overdue. It has, though, caused a couple of questions to me about why we used malloc so extensively within OpenRTE. Rather than answer these independently, I though it might help if I documented this history for future participants.
The decision to use dynamic, as opposed to static, memory allocation as our "standard" method within OpenRTE was made at an ORTE design meeting approximately two years ago. The overarching reasons for that decision were four-fold: 1. we didn't want to introduce any system-level constraints on sizes for things like arrays or strings; 2. given the large degree of flexibility in the system, only a small percentage of all code paths might be exercised in any given job. Static memory allocation would therefore have caused the overall memory footprint for OpenRTE to include storage for data that would likely never be used - whereas dynamic allocation would ensure we only consumed as much memory as required for that particular code path; 3. tracking down memory corruption is generally much more difficult than plugging memory leaks. Given (1), we either would have to continually check the size of data being given to us to ensure we weren't overrunning static allocations, or we would have to spend considerable time and effort tracking down memory corruption problems. We felt that it would be more time efficient (from a development standpoint) to avoid these problems and just malloc the memory - and then use valgrind and other tools to eventually plug any resulting leaks. Every now and then we do make a pass at reducing the worst of the leakage, but no really concerted effort has been made to-date as it just hasn't been enough of a problem to merit a high priority; and 4. the performance impact of using malloc was considered inconsequential to the OpenRTE functional requirements. Current measurements show that the total time to traverse a launch procedure is a few milliseconds (not including the time to send xcast stage gate messages - see my other note on scalability issues). This is well within any functional requirement expressed to date, so obviously the use of malloc hasn't created a major problem in that regard. As a result, the code contains a number of malloc/free combinations that typically involve small quantities of memory. Many of these are in debugging code that only gets called if/when a developer requests it be run - I generally ignore these as that is not a code path that ever gets exercised during normal operations. Of the remainder, please feel free to plug leaks and/or consider alternatives. Just please keep in mind the prior considerations when making your changes. Hope that helps. Ralph