Lenny,

Is there any particular reason that you're using the trunk? The reason I
ask is because the trunk is in an unusually high state of flux at the
moment with a major move underway. If you're trying to use OMPI for
production grade runs, I would strongly advise picking up one of the stable
releases in the 1.8.x series. At this time,1.8.1 is available as the most
current stable release. The 1.8.2rc3 prerelease candidate is also available:

http://www.open-mpi.org/software/ompi/v1.8/

Best,

Josh






On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Lenny,
>
> that looks related to #4857 which has been fixed in trunk since r32517
>
> could you please update your openmpi library and try again ?
>
> Gilles
>
>
> On 2014/08/13 17:00, Lenny Verkhovsky wrote:
>
> Following Jeff's suggestion adding devel mailing list.
>
> Hi All,
> I am currently facing strange situation that I can't run OMPI on more than 65 
> nodes.
> It seems like environmental issue that does not allow me to open more 
> connections.
> Any ideas ?
> Log attached, more info below in the mail.
>
> Running OMPI from trunk
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
> Thanks.
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com<http://www.mellanox.com> <http://www.mellanox.com>
>
>
> Office:    +972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:        +972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>] 
> On Behalf Of Lenny Verkhovsky
> Sent: Tuesday, August 12, 2014 1:13 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
> Hi,
>
> Config:
> ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
> --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
> --disable-openib-connectx-xrc
>
> Run:
> /home/sources/ompi-bin/bin/mpirun -np 65 --host 
> ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
>  --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 
> --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
> hostname 2>&1|tee > /tmp/mpi.log
>
> Environment:
>      According to the attached log it's rsh environment
>
>
> Output attached
>
> Notes:
> The problem is always with tha last node, 64 connections work, 65 connections 
> fail.
> node-119.ssauniversal.ssa.kodiak.nx == ko0237
>
> mpi.log line 1034:
> --------------------------------------------------------------------------
> An invalid value was supplied for an enum variable.
>   Variable     : orte_debug_daemons
>   Value        : 1,1
>   Valid values : 0: f|false|disabled, 1: t|true|enabled
> --------------------------------------------------------------------------
>
> mpi.log line 1059:
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologies
> www.mellanox.com<http://www.mellanox.com> <http://www.mellanox.com>
>
>
> Office:    +972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:        +972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>
> ] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 4:53 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Okay, let's start with the basics :-)
>
> How was this configured? What environment are you running in (rsh, slurm, 
> ??)? If you configured --enable-debug, then please run it with
>
> --mca plm_base_verbose 5 --debug-daemons
>
> and send the output
>
>
> On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
> <len...@mellanox.com<mailto:len...@mellanox.com> <len...@mellanox.com>> wrote:
>
> I don't think so,
> It's always the 66th node, even if I swap between 65th and 66th
> I also get the same error when setting np=66, while having only 65 hosts in 
> hostfile
> (I am using only tcp btl )
>
>
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologieswww.mellanox.com<http://www.mellanox.com/> 
> <http://www.mellanox.com/>
>
>
> Office:    +972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:        +972 72 257 9400
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>
> ] On Behalf Of Ralph Castain
> Sent: Monday, August 11, 2014 1:07 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
> Looks to me like your 65th host is missing the dstore library - is it 
> possible you don't have your paths set correctly on all hosts in your 
> hostfile?
>
>
> On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
> <len...@mellanox.com<mailto:len...@mellanox.com> <len...@mellanox.com>> wrote:
>
>
> Hi all,
>
> Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
> OMPI with more than 65 procs.
> It looks like MPI failes to open 66th connection even with running `hostname` 
> over tcp.
> It also seems to unrelated to specific host.
> All hosts are Ubuntu 12.04.1 LTS
>
> mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
> --mca btl tcp,self hostname
> [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_orted.c at line 288
>
> .......................................
> It looks like environment issue, but I can't find any limit related.
> Any ideas ?
> Thanks.
> Lenny Verkhovsky
> SW Engineer,  Mellanox Technologieswww.mellanox.com<http://www.mellanox.com/> 
> <http://www.mellanox.com/>
>
>
> Office:    +972 74 712 9244
> Mobile:  +972 54 554 0233
> Fax:        +972 72 257 9400
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org> <us...@open-mpi.org>
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24961.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org<mailto:us...@open-mpi.org> <us...@open-mpi.org>
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24964.php
>
>
>
> _______________________________________________
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15626.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15627.php
>

Reply via email to