Dear Rohan,

This is my current status.

When I tested dmtcp  I was only able to reach 768 core. I was using dmtcp 
version 2.3.1. with open mpi 1.6.5 + gcc 4.7.0. I am using tcp for mpi 
communication.

I tried to upgrade to dmtcp version 2.4.0 with open mpi 1.6.5  (tcp) + gcc 
4.7.0, but I still limited to 768 core.

I tried dmtcp   2.4.0 with open mpi 1.8.1  (tcp) + gcc 4.7.4,  I was able to 
reach 1024 cores for all my 4 applications, I was also able to reach 2048 cores 
for only 1 of my 4 applications.

Here are different error outputs that I got

========================== output error 1 
====================================== 
a0022][[58558,1],1006][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1007][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1010][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1017][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1018][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1021][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],1022][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],982][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[a0022][[58558,1],999][../../../../../../openmpi-1.8.1/ompi/mca/btl/tcp/btl_tcp_frag.c:215:mca_btl_tcp_frag_recv]
 mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
===========================================================================

========================== output error 2 
====================================== 

367, mpiblast[39406000:46565]@a0030.mogon, 22f5c423be81cb78-39406000-55e46e22, 
DRAINED                                                                
39368, mpiblast[39407000:59770]@a0332.mogon, 
6e3052d5e9f40313-39407000-55e46e22, DRAINED                                     
                             
39369, mpiblast[39408000:35959]@a0498.mogon, 
7a3a4d2715b716aa-39408000-55e46e22, DRAINED                                     
                              
39370, mpiblast[39409000:17255]@a0330.mogon, 
6e39ab0a063f8865-39409000-55e46e22, DRAINED                                     
                               
39371, mpiblast[39410000:60763]@a0187.mogon, 
10e5f03c5611ca07-39410000-55e46e22, DRAINED                                     
                                
39372, mpiblast[39411000:5314]@a0083.mogon, 413864e35f108602-39411000-55e46e22, 
DRAINED                                                                   
39373, mpiblast[39412000:62007]@a0069.mogon, 
353f621f1e4db4f2-39412000-55e46e22, DRAINED                                     
                       
39374, mpiblast[39413000:34154]@a0501.mogon, 
1ee125e0a4429b7d-39413000-55e46e22, DRAINED                                     
                              
39375, mpiblast[39414000:60770]@a0022.mogon, 
1cf496a78ffaa047-39414000-55e46e22, DRAINED
[38856000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; 
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 11
     buffer.size() = 0
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running under 
DMTCP?

]
===========================================================================

========================== output error 3 
====================================== 

851] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; REASON='Incremented 
Generation'
     compId.generation() = 2
[24851] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; 
REASON='locking all nodes'
[24851] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; 
REASON='draining all nodes'
[24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client 
disconnected'
     client->identity() = 2af185272bc239da-2338000-55e47828
[24851] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; 
REASON='checkpointing all nodes'
[24851] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; 
REASON='building name service database'
[24851] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; 
REASON='entertaining queries now'
[24851] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; 
REASON='refilling all nodes'
[24851] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; 
REASON='restarting all nodes'
[24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client 
disconnected'
     client->identity() = 353f621f1e4db4f2-4153000-55e4782b
[24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client 
disconnected'
     client->identity() = 25aa6f39995effba-2440000-55e47828
[24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client 
disconnected'
     client->identity() = 1f9fe9894e8c4f37-2601000-55e47828
[24851] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect;
===========================================================================


Best regads,


Ramy Gad
Johannes Gutenberg - Universität Mainz
Zentrums für Datenverarbeitung (ZDV)

Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
E-Mail: g...@uni-mainz.de
Office Phone: +49-6131-39-26437


________________________________________
From: Gad, Ramy <g...@uni-mainz.de>
Sent: Tuesday, August 18, 2015 11:26 AM
To: Rohan Garg
Cc: Nagel, Lars; dmtcp-forum@lists.sourceforge.net; Süß, Dr. Tim
Subject: Re: [Dmtcp-forum] DMTCP scaling potential

Dear Rohan,

Thank you for your replay.

I will have a look at the publication.

I will collect detail information about my setup and come to you as soon as 
possible.

Best regards,

Ramy Gad
Johannes Gutenberg - Universität Mainz
Zentrums für Datenverarbeitung (ZDV)

Anselm-Franz-von-Bentzel-Weg 12
55128 Mainz
Germany
E-Mail: g...@uni-mainz.de
Office Phone: +49-6131-39-26437


________________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Monday, August 17, 2015 5:09 PM
To: Gad, Ramy
Cc: dmtcp-forum@lists.sourceforge.net; Süß, Dr. Tim; Nagel, Lars
Subject: Re: [Dmtcp-forum] DMTCP scaling potential

Hi Ramy,

In the past we have tested with up to 2K cores. The results were
published in HPDC-2014 [1]. We are currently doing scalability
tests at Stampede [2], and have not noticed any issues up to
4K cores.

The inability to scale beyond 768 cores could be a bug in DMTCP,
or some configuration issue. My best guess (looking at the number 768)
would be that there is a limit on the number of open file descriptions per
process on the node where your coordinator is running.

Could you give us more details of your setup? In particular, it’ll be helpful
to know the following details:

 - DMTCP version
 - MPI library
 - Resource manager
 - Linux kernel version
 - Process limits (Try: ulimit -a)

If it helps, we’d be happy to assist you in setting up your environment.

[1]: http://www.ccs.neu.edu/home/gene/papers/hpdc14.pdf
[2]: https://www.tacc.utexas.edu/stampede/

Thanks,
Rohan

> On Aug 17, 2015, at 4:48 AM, Gad, Ramy <g...@uni-mainz.de> wrote:
>
> Hi,
>
> We have used DMTCP to checkpoint several mpi applications for example 
> mpiblast, ray, phylobayes and namd.
> However we were able to scale no more than 768 cores.
>
> My questions are :
>
> Is there a limitation on the maximum scaling potential with DMTCP  ?
>
> Have anyone done any scaling test?  if  so is this result available for 
> public ?
>
> can we scale more than 1K cores with DMTCP ?
>
> Best regards,​
>
> Ramy Gad
> Johannes Gutenberg - Universität Mainz
> Zentrums für Datenverarbeitung (ZDV)
>
> Anselm-Franz-von-Bentzel-Weg 12
> 55128 Mainz
> Germany
> E-Mail: g...@uni-mainz.de
> Office Phone: +49-6131-39-26437
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to