Ah, I see. That change didn't make it into the release branch (I don't know
if it was never CMRed or what, I have a vague recollection of it passing
through.) If you need that change, then I recommend checking out the trunk
at r30875. This was back when the trunk was in a more stable state.


Best,

Josh


On Wed, Aug 13, 2014 at 9:29 AM, Lenny Verkhovsky <len...@mellanox.com>
wrote:

>  Hi,
>
> I needed the following commit
>
>
>
> r30875 | vasily | 2014-02-27 13:29:47 +0200 (Thu, 27 Feb 2014) | 3 lines
>
> OPENIB BTL/CONNECT: Add support for AF_IB addressing in rdmacm.
>
>
>
> Following Gilles’s  mail about known #4857 issue I got update and now I
> can run with more than 65 hosts.
>
> ( thanks,  Gilles )
>
>
>
> Since I am facing another problem, I probably should try 1.8rc as you
> suggested.
>
> Thanks.
>
> *Lenny Verkhovsky*
>
> SW Engineer,  Mellanox Technologies
>
> www.mellanox.com
>
>
>
> Office:    +972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:        +972 72 257 9400
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Joshua
> Ladd
> *Sent:* Wednesday, August 13, 2014 4:20 PM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] [OMPI users] OpenMPI fails with np > 65
>
>
>
> Lenny,
>
> Is there any particular reason that you're using the trunk? The reason I
> ask is because the trunk is in an unusually high state of flux at the
> moment with a major move underway. If you're trying to use OMPI for
> production grade runs, I would strongly advise picking up one of the stable
> releases in the 1.8.x series. At this time,1.8.1 is available as the most
> current stable release. The 1.8.2rc3 prerelease candidate is also available:
>
> http://www.open-mpi.org/software/ompi/v1.8/
>
> Best,
>
> Josh
>
>
>
>
>
>
> On Wed, Aug 13, 2014 at 5:19 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> Lenny,
>
> that looks related to #4857 which has been fixed in trunk since r32517
>
> could you please update your openmpi library and try again ?
>
> Gilles
>
>
>
> On 2014/08/13 17:00, Lenny Verkhovsky wrote:
>
>  Following Jeff's suggestion adding devel mailing list.
>
>
>
> Hi All,
>
> I am currently facing strange situation that I can't run OMPI on more than 65 
> nodes.
>
> It seems like environmental issue that does not allow me to open more 
> connections.
>
> Any ideas ?
>
> Log attached, more info below in the mail.
>
>
>
> Running OMPI from trunk
>
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
> Thanks.
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
>  www.mellanox.com<http://www.mellanox.com> <http://www.mellanox.com>
>
>
>
>
>
> Office:    +972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:        +972 72 257 9400
>
>
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>] 
> On Behalf Of Lenny Verkhovsky
>
> Sent: Tuesday, August 12, 2014 1:13 PM
>
> To: Open MPI Users
>
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
>
>
>
> Hi,
>
>
>
> Config:
>
> ./configure --enable-openib-rdmacm-ibaddr --prefix /home/sources/ompi-bin 
> --enable-mpirun-prefix-by-default --with-openib=/usr/local --enable-debug 
> --disable-openib-connectx-xrc
>
>
>
> Run:
>
> /home/sources/ompi-bin/bin/mpirun -np 65 --host 
> ko0067,ko0069,ko0070,ko0074,ko0076,ko0079,ko0080,ko0082,ko0085,ko0087,ko0088,ko0090,ko0096,ko0098,ko0099,ko0101,ko0103,ko0107,ko0111,ko0114,ko0116,ko0125,ko0128,ko0134,ko0141,ko0144,ko0145,ko0148,ko0149,ko0150,ko0152,ko0154,ko0156,ko0157,ko0158,ko0162,ko0164,ko0166,ko0168,ko0170,ko0174,ko0178,ko0181,ko0185,ko0190,ko0192,ko0195,ko0197,ko0200,ko0203,ko0205,ko0207,ko0209,ko0210,ko0211,ko0213,ko0214,ko0217,ko0218,ko0223,ko0228,ko0229,ko0231,ko0235,ko0237
>  --mca btl openib,self  --mca btl_openib_cpc_include rdmacm --mca pml ob1 
> --mca btl_openib_if_include mthca0:1 --mca plm_base_verbose 5 --debug-daemons 
> hostname 2>&1|tee > /tmp/mpi.log
>
>
>
> Environment:
>
>      According to the attached log it's rsh environment
>
>
>
>
>
> Output attached
>
>
>
> Notes:
>
> The problem is always with tha last node, 64 connections work, 65 connections 
> fail.
>
> node-119.ssauniversal.ssa.kodiak.nx == ko0237
>
>
>
> mpi.log line 1034:
>
> --------------------------------------------------------------------------
>
> An invalid value was supplied for an enum variable.
>
>   Variable     : orte_debug_daemons
>
>   Value        : 1,1
>
>   Valid values : 0: f|false|disabled, 1: t|true|enabled
>
> --------------------------------------------------------------------------
>
>
>
> mpi.log line 1059:
>
> [node-119.ssauniversal.ssa.kodiak.nx:02996] [[56978,0],65] ORTE_ERROR_LOG: 
> Error in file base/ess_base_std_orted.c at line 288
>
>
>
>
>
>
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
>  www.mellanox.com<http://www.mellanox.com> <http://www.mellanox.com>
>
>
>
>
>
> Office:    +972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:        +972 72 257 9400
>
>
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>
>
>  ] On Behalf Of Ralph Castain
>
> Sent: Monday, August 11, 2014 4:53 PM
>
> To: Open MPI Users
>
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
>
> Okay, let's start with the basics :-)
>
>
>
> How was this configured? What environment are you running in (rsh, slurm, 
> ??)? If you configured --enable-debug, then please run it with
>
>
>
> --mca plm_base_verbose 5 --debug-daemons
>
>
>
> and send the output
>
>
>
>
>
> On Aug 11, 2014, at 12:07 AM, Lenny Verkhovsky 
> <len...@mellanox.com<mailto:len...@mellanox.com> <len...@mellanox.com>> wrote:
>
>
>
> I don't think so,
>
> It's always the 66th node, even if I swap between 65th and 66th
>
> I also get the same error when setting np=66, while having only 65 hosts in 
> hostfile
>
> (I am using only tcp btl )
>
>
>
>
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
> www.mellanox.com<http://www.mellanox.com/> <http://www.mellanox.com/>
>
>
>
>
>
> Office:    +972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:        +972 72 257 9400
>
>
>
> From: users [mailto:users-boun...@open-mpi.org <users-boun...@open-mpi.org>
>
>  ] On Behalf Of Ralph Castain
>
> Sent: Monday, August 11, 2014 1:07 AM
>
> To: Open MPI Users
>
> Subject: Re: [OMPI users] OpenMPI fails with np > 65
>
>
>
> Looks to me like your 65th host is missing the dstore library - is it 
> possible you don't have your paths set correctly on all hosts in your 
> hostfile?
>
>
>
>
>
> On Aug 10, 2014, at 1:13 PM, Lenny Verkhovsky 
> <len...@mellanox.com<mailto:len...@mellanox.com> <len...@mellanox.com>> wrote:
>
>
>
>
>
> Hi all,
>
>
>
> Trying to run OpenMPI ( trunk Revision: 32428 ) I faced the problem running 
> OMPI with more than 65 procs.
>
> It looks like MPI failes to open 66th connection even with running `hostname` 
> over tcp.
>
> It also seems to unrelated to specific host.
>
> All hosts are Ubuntu 12.04.1 LTS
>
>
>
> mpirun -np 66 --hostfile /proj/SSA/Mellanox/tmp//20140810_070156_hostfile.txt 
> --mca btl tcp,self hostname
>
> [nodename] [[4452,0],65] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_orted.c at line 288
>
>
>
> .......................................
>
> It looks like environment issue, but I can't find any limit related.
>
> Any ideas ?
>
> Thanks.
>
> Lenny Verkhovsky
>
> SW Engineer,  Mellanox Technologies
>
> www.mellanox.com<http://www.mellanox.com/> <http://www.mellanox.com/>
>
>
>
>
>
> Office:    +972 74 712 9244
>
> Mobile:  +972 54 554 0233
>
> Fax:        +972 72 257 9400
>
>
>
> _______________________________________________
>
> users mailing list
>
>  us...@open-mpi.org<mailto:us...@open-mpi.org> <us...@open-mpi.org>
>
>
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24961.php
>
>
>
> _______________________________________________
>
> users mailing list
>
>  us...@open-mpi.org<mailto:us...@open-mpi.org> <us...@open-mpi.org>
>
>
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/08/24964.php
>
>
>
>
>
>
>
> _______________________________________________
>
> devel mailing list
>
> de...@open-mpi.org
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15626.php
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15627.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15630.php
>

Reply via email to