Dear Rohan, This is my current status.
When I tested dmtcp I was only able to reach 768 core. I was using dmtcp version 2.3.1. with open mpi 1.6.5 + gcc 4.7.0. I am using tcp for mpi communication. I tried to upgrade to dmtcp version 2.4.0 with open mpi 1.6.5 (tcp) + gcc 4.7.0, but I still limited to 768 core. I tried dmtcp 2.4.0 with open mpi 1.8.1 (tcp) + gcc 4.7.4, I was able to reach 1024 cores for all my 4 applications, I was also able to reach 2048 cores for only 1 of my 4 applications. Here are different error outputs that I got ========================== output error 1 ====================================== a0022][[58558,1],1006][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1007][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1010][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1017][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1018][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1021][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],1022][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],982][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) [a0022][[58558,1],999][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104) =========================================================================== ========================== output error 2 ====================================== 367, mpiblast[39406000:46565]@a0030.mogon, 22f5c423be81cb78-39406000-55e46e22, DRAINED 39368, mpiblast[39407000:59770]@a0332.mogon, 6e3052d5e9f40313-39407000-55e46e22, DRAINED 39369, mpiblast[39408000:35959]@a0498.mogon, 7a3a4d2715b716aa-39408000-55e46e22, DRAINED 39370, mpiblast[39409000:17255]@a0330.mogon, 6e39ab0a063f8865-39409000-55e46e22, DRAINED 39371, mpiblast[39410000:60763]@a0187.mogon, 10e5f03c5611ca07-39410000-55e46e22, DRAINED 39372, mpiblast[39411000:5314]@a0083.mogon, 413864e35f108602-39411000-55e46e22, DRAINED 39373, mpiblast[39412000:62007]@a0069.mogon, 353f621f1e4db4f2-39412000-55e46e22, DRAINED 39374, mpiblast[39413000:34154]@a0501.mogon, 1ee125e0a4429b7d-39413000-55e46e22, DRAINED 39375, mpiblast[39414000:60770]@a0022.mogon, 1cf496a78ffaa047-39414000-55e46e22, DRAINED [38856000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 11 buffer.size() = 0 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? ] =========================================================================== ========================== output error 3 ====================================== 851] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; REASON='Incremented Generation' compId.generation() = 2 [24851] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; REASON='locking all nodes' [24851] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; REASON='draining all nodes' [24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client disconnected' client->identity() = 2af185272bc239da-2338000-55e47828 [24851] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; REASON='checkpointing all nodes' [24851] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; REASON='building name service database' [24851] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; REASON='entertaining queries now' [24851] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; REASON='refilling all nodes' [24851] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; REASON='restarting all nodes' [24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client disconnected' client->identity() = 353f621f1e4db4f2-4153000-55e4782b [24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client disconnected' client->identity() = 25aa6f39995effba-2440000-55e47828 [24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client disconnected' client->identity() = 1f9fe9894e8c4f37-2601000-55e47828 [24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; =========================================================================== Best regads, Ramy Gad Johannes Gutenberg - Universität Mainz Zentrums für Datenverarbeitung (ZDV) Anselm-Franz-von-Bentzel-Weg 12 55128 Mainz Germany E-Mail: g...@uni-mainz.de Office Phone: +49-6131-39-26437 ________________________________________ From: Gad, Ramy <g...@uni-mainz.de> Sent: Tuesday, August 18, 2015 11:26 AM To: Rohan Garg Cc: Nagel, Lars; dmtcp-forum@lists.sourceforge.net; Süß, Dr. Tim Subject: Re: [Dmtcp-forum] DMTCP scaling potential Dear Rohan, Thank you for your replay. I will have a look at the publication. I will collect detail information about my setup and come to you as soon as possible. Best regards, Ramy Gad Johannes Gutenberg - Universität Mainz Zentrums für Datenverarbeitung (ZDV) Anselm-Franz-von-Bentzel-Weg 12 55128 Mainz Germany E-Mail: g...@uni-mainz.de Office Phone: +49-6131-39-26437 ________________________________________ From: Rohan Garg <rohg...@ccs.neu.edu> Sent: Monday, August 17, 2015 5:09 PM To: Gad, Ramy Cc: dmtcp-forum@lists.sourceforge.net; Süß, Dr. Tim; Nagel, Lars Subject: Re: [Dmtcp-forum] DMTCP scaling potential Hi Ramy, In the past we have tested with up to 2K cores. The results were published in HPDC-2014 [1]. We are currently doing scalability tests at Stampede [2], and have not noticed any issues up to 4K cores. The inability to scale beyond 768 cores could be a bug in DMTCP, or some configuration issue. My best guess (looking at the number 768) would be that there is a limit on the number of open file descriptions per process on the node where your coordinator is running. Could you give us more details of your setup? In particular, it’ll be helpful to know the following details: - DMTCP version - MPI library - Resource manager - Linux kernel version - Process limits (Try: ulimit -a) If it helps, we’d be happy to assist you in setting up your environment. [1]: http://www.ccs.neu.edu/home/gene/papers/hpdc14.pdf [2]: https://www.tacc.utexas.edu/stampede/ Thanks, Rohan > On Aug 17, 2015, at 4:48 AM, Gad, Ramy <g...@uni-mainz.de> wrote: > > Hi, > > We have used DMTCP to checkpoint several mpi applications for example > mpiblast, ray, phylobayes and namd. > However we were able to scale no more than 768 cores. > > My questions are : > > Is there a limitation on the maximum scaling potential with DMTCP ? > > Have anyone done any scaling test? if so is this result available for > public ? > > can we scale more than 1K cores with DMTCP ? > > Best regards, > > Ramy Gad > Johannes Gutenberg - Universität Mainz > Zentrums für Datenverarbeitung (ZDV) > > Anselm-Franz-von-Bentzel-Weg 12 > 55128 Mainz > Germany > E-Mail: g...@uni-mainz.de > Office Phone: +49-6131-39-26437 > > ------------------------------------------------------------------------------ > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum