Re: [Dmtcp-forum] Performance of Restarted MPI applications under DMTCP

Maksym Planeta Thu, 01 Sep 2022 10:55:16 -0700

Hi Cyrill,

I remember having similarly looking problem with CRIU [1]. Try to take a look 
at performance counters.



https://github.com/checkpoint-restore/criu/issues/1171

On 8/29/22 09:47, Cyrill Burth wrote:

Hi,
I was working the last few weeks with DMTCP and made some performance benchmarks. Therefore I have used the NPB 3.4.2 BT- MPI benchmark [1] at the Taurus Supercomputer at the TU Dresden always with 16 MPI ranks and gzip disabled.
I have realized that if I would restart an application from its checkpoint it would (drastically) slow down compared tobefore the checkpoint, I will refer to this as phenomena as "restart penalty".
I will describe shortly my methodology: I have performed an checkpoint in the 20th iteration and if I took the timebefore restart from the 21st to last iteration of the benchmark it would be between 25% to 45% less then when I did thesame after restarting from the checkpoint in the 20th iteration. I verified this with the MPI benchmark (25%-45%"restart penalty") as well as with the OpenMP benchmark (consistent 15% restart penalty) which is also provided by NPBunder [1]. I ran all tests multiple times on multiple nodes and all of them yielded the same results. To compile and runthe benchmark I have used the intel/2019b toolchain, since I had some compatibility issues with newer versions.I have repeated the tests with application initiated checkpointing as well as with the "-i" option, without modifyingthe benchmarks source code. Both yielded the same results.
However the reason I am contacting you is since I have not only realized the behavior described above but also that the"restart penalty" seems to scale with the speed of the used filesystem at least when using MPI. If I would restart fromour relatively slow local SSDs, I have seen a "restart penalty" of roughly 45%, however if I restarted the samecheckpoint from a RAM disk, I would only see a "restart penalty" of 25%. This could only be seen when using the MPIversion of the benchmark, for the OpenMP version there was seen a "restart penalty" of 15%, but it would not scale withthe used filesystem.
I was wondering if anyone could give me any insights that could explain this 
behavior.
The restart times themselves obviously go up when the slower filesystem is used, but this was to be expected, however itappears rather odd that the performance after restart depends on the filesystem used for restart. Some further researchshowed that every single iteration of the benchmark gets slowed down. It is *not* the case that some iterations takesignificantly longer than others.There were no further checkpoints taken except for the very first one in the 20th iteration from which I have restartedand which was excluded from the time measurements.
Thank you very much in advance.


Best regards,

C. Burth


[1] https://www.nas.nasa.gov/software/npb.html



_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


--
Regards,
Maksym Planeta


_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Re: [Dmtcp-forum] Performance of Restarted MPI applications under DMTCP

Reply via email to