Re: [OMPI devel] Need to know your Github ID

2014-09-12 Thread Brad Benton
bbenton -> bbenton

On Wed, Sep 10, 2014 at 5:46 AM, Jeff Squyres (jsquyres)  wrote:

> As the next step of the planned migration to Github, I need to know:
>
> - Your Github ID (so that you can be added to the new OMPI git repo)
> - Your SVN ID (so that I can map SVN->Github IDs, and therefore map Trac
> tickets to appropriate owners)
>
> Here's the list of SVN IDs who have committed over the past year -- I'm
> guessing that most of these people will need Github IDs:
>
>  adrian
>  alekseys
>  alex
>  alinas
>  amikheev
>  bbenton
>  bosilca (done)
>  bouteill
>  brbarret
>  bwesarg
>  devendar
>  dgoodell (done)
>  edgar
>  eugene
>  ggouaillardet
>  hadi
>  hjelmn
>  hpcchris
>  hppritcha
>  igoru
>  jjhursey (done)
>  jladd
>  jroman
>  jsquyres (done)
>  jurenz
>  kliteyn
>  manjugv
>  miked (done)
>  mjbhaskar
>  mpiteam (done)
>  naughtont
>  osvegis
>  pasha
>  regrant
>  rfaucett
>  rhc (done)
>  rolfv (done)
>  samuel
>  shiqing
>  swise
>  tkordenbrock
>  vasily
>  vvenkates
>  vvenkatesan
>  yaeld
>  yosefe
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15788.php
>


Re: [OMPI devel] openib max_cqe

2012-07-09 Thread Brad Benton
I am running into similar issues with both Mellanox and IBM HCAs.

On a node installed with RHEL6.2 and MLNX_OFED-1.5.3-3.0.0, there is a
significant hit to locked memory when going with the device's max_cqe.
Here, for comparison's sake is the memory utilization for a simple MPI
process when using the new cq_size default, and when restricting it to 1500:

cq_size = max_cqe:
VmPeak:   348736 kB
VmSize:   348352 kB
VmLck:292096 kB
VmHWM:304896 kB
VmRSS:304896 kB
VmData:   333504 kB

cq_size = 1500
VmPeak:86720 kB
VmSize:86336 kB
VmLck: 30080 kB
VmHWM: 42880 kB
VmRSS: 42880 kB
VmData:71488 kB

For our Power systems using the IBM eHCA, the default value exhausts memory
and we can't even run.

--Brad


On Fri, Jul 6, 2012 at 5:21 AM, TERRY DONTJE wrote:

>
>
> On 7/5/2012 5:47 PM, Shamis, Pavel wrote:
>
>  I mentioned on the call that for Mellanox devices (+OFA verbs) this resource 
> is really cheap. Do you run mellanox hca + OFA verbs ?
>
>  (I'll reply because I know Terry is offline for the rest of the day)
>
> Yes, he does.
>
>  I asked because SUN used to have own verbs driver.
>
>  I noticed this on a Solaris machine, I am not sure I have the same set up
> for Linux but I'll look and see if I can reproduce the same issue on Linux.
>
> --td
>
>   The heart of the question: is it incorrect to assume that we'll consume 
> (num CQE * CQE size) registered memory for each QP opened?
>
>  QP or CQ ?  I think you don't want to assume anything there. Verbs (user and 
> kernel) do their own magic there.
> I think Mellanox should address this question.
>
> Regards,
> Pasha
> ___
> devel mailing 
> listdevel@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
>   Terry D. Dontje | Principal Software Engineer
> Developer Tools Engineering | +1.781.442.2631
>  Oracle * - Performance Technologies*
>  95 Network Drive, Burlington, MA 01803
> Email terry.don...@oracle.com
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [hwloc-devel] PPC64 problem with hwloc 1.1.1

2011-02-22 Thread Brad Benton
On Tue, Feb 22, 2011 at 9:29 PM, Samuel Thibault
<samuel.thiba...@inria.fr>wrote:

> Hello,
>
> Brad Benton, le Wed 23 Feb 2011 03:48:17 +0100, a écrit :
> > Attached are two sets of info...one for the case when SMT
> > (Simultaneous MultiThreading) is off, and the other for when it is on.
>
> Ok, found the issue, which is that the device tree also reports the
> thread information for the disabled threads, I have commited the
> attached fix.
>

Great!


>
> Is it OK for you that we integrate your hwloc-gather-topo-smtoff.tar.bz2
> tarball in our series of regression tests?
>

Absolutely.

Thanks,
--Brad


>
> Samuel
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>
>


Re: [hwloc-devel] PPC64 problem with hwloc 1.1.1

2011-02-22 Thread Brad Benton
On Tue, Feb 22, 2011 at 6:24 PM, Jeff Squyres  wrote:

> Done!
>
> But we still need that info from Brad.  :-)
>
> (just in case he's only lightly reading this thread...)
>

Attached are two sets of info...one for the case when SMT
(Simultaneous MultiThreading) is off, and the other for when it is on.

This is on a 32-core Power6 system running RHEL5.4 with a 2.6.18 kernel.

Thanks,
--Brad



>
> On Feb 22, 2011, at 7:05 PM, Samuel Thibault wrote:
>
> > Jeff Squyres, le Wed 23 Feb 2011 00:20:42 +0100, a écrit :
> >> On Feb 22, 2011, at 6:02 PM, Samuel Thibault wrote:
> >>
> >>> Note the "/* TODO: how to report? */" comment in the code: we
> definitely
> >>> _want_ to get users to see the warning and report it.
> >>
> >> Ah, ok.  Could we make that message a little more clear, then?  Maybe
> something like this:
> >
> > That'd be much more clear yes :)
> >
> > Samuel
> > ___
> > hwloc-devel mailing list
> > hwloc-de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>


hwloc-gather-topo-smtoff.output
Description: Binary data


hwloc-gather-topo-smtoff.tar.bz2
Description: BZip2 compressed data


hwloc-gather-topo-smton.output
Description: Binary data


hwloc-gather-topo-smton.tar.bz2
Description: BZip2 compressed data


Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-02-03 Thread Brad Benton
What IP interfaces are configured on the cluster?  In particular, are there
IPoIB interfaces that are configured?  If you use the dynamic connection
method but restrict either the number or type of IP interfaces to be used
via oob_tcp_if_{include,exclude}, do you still see the problem?

--brad


On Wed, Jan 26, 2011 at 12:14 AM, Doron Shoham  wrote:

> using the flag --mca mpi_preconnect_mpi seems to solved the issue with the
> oob connection manager.
> This solution is not scalable but it looks more and more like a connection
> establishment problem.
> I'm still trying to figure out what is the root cause of this and how to
> solve it.
> Any ideas will be more then welcome.
>
>
> Thanks,
> Doron
>
> On Tue, Jan 18, 2011 at 3:29 PM, Terry Dontje wrote:
>
>>  On 01/18/2011 07:48 AM, Jeff Squyres wrote:
>>
>> IBCM is broken and disabled (has been for a long time).
>>
>> Did you mean RDMACM?
>>
>>
>>
>> No I think I meant OMPI oob.
>>
>> sorry,
>>
>> --
>> [image: Oracle]
>>
>> Terry D. Dontje | Principal Software Engineer
>> Developer Tools Engineering | +1.781.442.2631
>> Oracle *- Performance Technologies*
>> 95 Network Drive, Burlington, MA 01803
>> Email terry.don...@oracle.com
>>
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] [Fwd: strong ordering for data registered memory]

2009-11-11 Thread Brad Benton
On Wed, Nov 11, 2009 at 10:29 AM, Terry Dontje  wrote:

> Jeff Squyres wrote:
>
>> On Nov 11, 2009, at 8:13 AM, Terry Dontje wrote:
>>
>>  Sun's IB group has asked me to forward the following email to see if
>>> anyone has any comments on this email.
>>>
>>>
>> Tastes great / less filling.  :-)
>>
>> I think (assume) we'll be happy to implement changes like this that come
>> from the upstream OpenFabrics verbs API (I see that this mail was first
>> directed to the linux-rdma list, which is where the OF verbs API is
>> discussed).
>>
>>  That's pretty much what I presumed.  I am curious though if others think
> their platforms could/would use such hooks?
>

We at IBM would be interested in such a hook and could use it with some of
our architectures.

--brad


Re: [OMPI devel] 1.3.3 Release Schedule

2009-06-17 Thread Brad Benton
On Wed, Jun 17, 2009 at 6:45 AM, Jeff Squyres <jsquy...@cisco.com> wrote:

> Looks good to me.  Brad -- can you add this to the wiki in the 1.3 series
> page?


done: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.3

<https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.3>--brad



>
>
> On Jun 16, 2009, at 10:37 PM, Brad Benton wrote:
>
>  All:
>>
>> We are close to releasing 1.3.3.  This is the current plan:
>>  - Evening of 6/16: collect MTT runs on the current branch w/the current
>> 1.3.3 features & fixes
>>  - If all goes well with the overnight MTT runs, roll a release candidate
>> on 6/17
>>  - Put 1.3.3rc1 through its paces over the next couple of days
>>  - If all goes well with rc1, release 1.3.3 on Friday, June 19
>>
>> 1.3.3 will include support for Windows as its major new feature, as well
>> as a number of defect fixes.
>>
>> 1.3.3 will be the final feature release in the 1.3 series.  As such, with
>> the new feature/stable numbering
>> scheme, the next release in the series will contain defect fixes only and
>> will transition to 1.4.  This
>> will be the stable/maintenance branch.  The plan is for it to follow the
>> 1.3.3 release by a fairly short time
>> (4-6 weeks), and subsequent releases in the series will take place as need
>> be depending on the bug
>> fix volume & criticality.
>>
>> Thanks,
>> --brad
>> 1.3/1.4 co-release mgr
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] 1.3.3 Release Schedule

2009-06-16 Thread Brad Benton
All:
We are close to releasing 1.3.3.  This is the current plan:
  - Evening of 6/16: collect MTT runs on the current branch w/the current
1.3.3 features & fixes
  - If all goes well with the overnight MTT runs, roll a release candidate
on 6/17
  - Put 1.3.3rc1 through its paces over the next couple of days
  - If all goes well with rc1, release 1.3.3 on Friday, June 19

1.3.3 will include support for Windows as its major new feature, as well as
a number of defect fixes.

1.3.3 will be the final feature release in the 1.3 series.  As such, with
the new feature/stable numbering
scheme, the next release in the series will contain defect fixes only and
will transition to 1.4.  This
will be the stable/maintenance branch.  The plan is for it to follow the
1.3.3 release by a fairly short time
(4-6 weeks), and subsequent releases in the series will take place as need
be depending on the bug
fix volume & criticality.

Thanks,
--brad
1.3/1.4 co-release mgr


[OMPI devel] Fwd: [Open MPI Announce] Critical bug notice

2009-03-27 Thread Brad Benton
In reference to this critical bug, there are implications for the current
1.3.x release schedule that are alluded to in Jeff's message.  In
particular, there are two time-critical issues at play:
  1) getting a fix for #1853 in time for inclusion for OFED-1.4.1  2)
getting in Sun's changes/CMRs in time for their next test/release cycle

Given those two time-constrained goals, we have decided to proceed as
follows:
  - Sun's desired changes are either already in the 1.3 branch, or the CMRs
have already been approved for inclusion
  - hold off non-Sun related CMRs until a fix for #1853 is available,
hopefully sometime next week
  - release this combination as 1.3.2
  - the windows functionality will then follow as a separate release: 1.3.3

I know that this, once again, pushes out the windows functionality, but I
think that this is necessary in order to get this critical fix in.

Thanks,
--Brad



-- Forwarded message --
From: Jeff Squyres 
List-Post: devel@lists.open-mpi.org
Date: Fri, Mar 27, 2009 at 1:34 PM
Subject: [Open MPI Announce] Critical bug notice
To: Open MPI Announcements , Open MPI Developers <
de...@open-mpi.org>, Open MPI Users 


The Open MPI team has uncovered a serious bug in Open MPI v1.3.0 and v1.3.1:
when running on OpenFabrics-based networks, silent data corruption is
possible in some cases.  There are two workarounds to avoid the issue --
please see the bug ticket that has been opened about this issue for further
details:

   https://svn.open-mpi.org/trac/ompi/ticket/1853

We strongly encourage all users who are using Open MPI v1.3.0 and/or v1.3.1
on OpenFabrics-based networks to read this ticket and use one of the
workarounds described there.

The Open MPI team is working on a fix; it will be included in the v1.3.2
release.  Updates will be posted to the ticket.

-- 
Jeff Squyres
Cisco Systems

___
announce mailing list
annou...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/announce


Re: [OMPI devel] 1.3.1rc5

2009-03-19 Thread Brad Benton
Things look good from the IBM side as well.  So, RM-approved for release.
--brad
ompi 1.3 co-release manager



On Thu, Mar 19, 2009 at 7:31 AM, Jeff Squyres  wrote:

> Looks good to cisco.  Ship it.
>
> I'm still seeing a very low incidence of the sm segv during startup (.01%
> -- 23 tests out of ~160k), so let's ship 1.3.1 and roll in Eugene's new sm
> code for 1.3.2.
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] 1.3.1?

2009-03-12 Thread Brad Benton
Ahh...replied to the MTT segv thread...but will reiterate here:  George & I
talked and we are in agreement that we should go ahead and release 1.3.1 as
it currently stands.
Now on to 1.3.2!

--brad


On Thu, Mar 12, 2009 at 7:52 AM, Jeff Squyres  wrote:

> So -- RM's -- can we release 1.3.1?  The tarball is ready (it's made at the
> same time as RC tarballs to guarantee that it's the same).  All that's
> necessary is posting it to the web site and sending out the announcement.
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
r20275 looks good.  I suggest that we CMR that into 1.3 and get rc6 rolled
and tested. (actually, Jeff just did the CMR...so off to rc6)
--brad


On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabriel  wrote:

> so I am not entirely sure why the bug only happened on trunk, it could in
> theory also appear on v1.3 (is there a difference on how pointer_arrays are
> handled between the two versions?)
>
> Anyway, it passes now on both with changeset 20275. We should probably move
> that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to
> others to decide...
>
> Thanks
> Edgar
>
>
> Edgar Gabriel wrote:
>
>> I'm already debugging it. the good news is that it only seems to appear
>> with trunk, with 1.3 (after copying the new tuned module over), all the
>> tests pass.
>>
>> Now if somebody can tell me a trick on how to tell mpirun not kill the
>> debugger under my feet, then I could even see where the problem occurs:-)
>>
>> Thanks
>> Edga
>>
>> George Bosilca wrote:
>>
>>> All these errors are in the MPI_Finalize, it should not be that hard to
>>> find. I'll take a look later this afternoon.
>>>
>>>  george.
>>>
>>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>>
>>>  Unfortunately, although this fixed some problems when enabling hierarch
 coll,
 there is still a segfault in two of IU's tests that only shows up when
 we set
 -mca coll_hierarch_priority 100

 See this MTT summary to see how the failures improved on the trunk,
 but that there are still two that segfault even at 1.4a1r20267:
 http://www.open-mpi.org/mtt/index.php?do_redir=923

 This link just has the remaining failures:
 http://www.open-mpi.org/mtt/index.php?do_redir=922

 So, I'll vote for applying the CMR for 1.3 since it clearly improved
 things,
 but there is still more to be done to get coll_hierarch ready for
 regular
 use.

 On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
 wrote:

> Here we go by the book :)
>
> https://svn.open-mpi.org/trac/ompi/ticket/1749
>
> george.
>
> On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
>
>  Let's debate tomorrow when people are around, but first you have to
>> file a
>> CMR... :-)
>>
>> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>>
>>  Unfortunately, this pinpoint the fact that we didn't test enough the
>>> collective module mixing thing. I went over the tuned collective
>>> functions
>>> and changed all instances to use the correct module information. It
>>> is now
>>> on the trunk, revision 20267. Simultaneously,I checked that all other
>>> collective components do the right thing ... and I have to admit
>>> tuned was
>>> the only faulty one.
>>>
>>> This is clearly a bug in the tuned, and correcting it will allow
>>> people
>>> to use the hierarch. In the current incarnation 1.3 will
>>> mostly/always
>>> segfault when hierarch is active. I would prefer not to give a broken
>>> toy
>>> out there. How about pushing r20267 in the 1.3?
>>>
>>> george.
>>>
>>>
>>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>>
>>>  Thanks for digging into this.  Can you file a bug?  Let's mark it
 for
 v1.3.1.

 I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
 and
 since hierarch isn't currently selected by default (you must
 specifically
 elevate hierarch's priority to get it to run), there's no danger
 that users
 will run into this problem in default runs.

 But clearly the problem needs to be fixed, and therefore we need a
 bug
 to track it.



 On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:

  I just debugged the Reduce_scatter bug mentioned previously. The
> bug is
> unfortunately not in hierarch, but in tuned.
>
> Here is the code snipplet causing the problems:
>
> int reduce_scatter (, mca_coll_base_module_t *module)
> {
> ...
> err = comm->c_coll.coll_reduce (, module)
> ...
> }
>
>
> but should be
> {
> ...
> err = comm->c_coll.coll_reduce (...,
> comm->c_coll.coll_reduce_module);
> ...
> }
>
> The problem as it is right now is, that when using hierarch, only a
> subset of the function are set, e.g. reduce,allreduce, bcast and
> barrier.
> Thus, reduce_scatter is from tuned in most scenarios, and calls the
> subsequent functions with the wrong module. Hierarch of course does
> not like
> that :-)
>
> Anyway, a quick glance through the tuned code reveals a significant
> number of 

Re: [OMPI devel] reduce_scatter bug with hierarch

2009-01-14 Thread Brad Benton
So, if it looks okay on 1.3...then there should not be anything holding up
the release, right?  Otherwise, George we need to decide on whether or not
this is a blocker, or if we go ahead and release with this as a known issue
and schedule the fix for 1.3.1.  My vote is to go ahead and release, but if
you (or others) think otherwise, let's talk about how best to move forward.
--brad


On Wed, Jan 14, 2009 at 12:04 PM, Edgar Gabriel  wrote:

> I'm already debugging it. the good news is that it only seems to appear
> with trunk, with 1.3 (after copying the new tuned module over), all the
> tests pass.
>
> Now if somebody can tell me a trick on how to tell mpirun not kill the
> debugger under my feet, then I could even see where the problem occurs:-)
>
> Thanks
> Edga
>
>
> George Bosilca wrote:
>
>> All these errors are in the MPI_Finalize, it should not be that hard to
>> find. I'll take a look later this afternoon.
>>
>>  george.
>>
>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
>>
>>  Unfortunately, although this fixed some problems when enabling hierarch
>>> coll,
>>> there is still a segfault in two of IU's tests that only shows up when we
>>> set
>>> -mca coll_hierarch_priority 100
>>>
>>> See this MTT summary to see how the failures improved on the trunk,
>>> but that there are still two that segfault even at 1.4a1r20267:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=923
>>>
>>> This link just has the remaining failures:
>>> http://www.open-mpi.org/mtt/index.php?do_redir=922
>>>
>>> So, I'll vote for applying the CMR for 1.3 since it clearly improved
>>> things,
>>> but there is still more to be done to get coll_hierarch ready for regular
>>> use.
>>>
>>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca 
>>> wrote:
>>>
 Here we go by the book :)

 https://svn.open-mpi.org/trac/ompi/ticket/1749

 george.

 On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:

  Let's debate tomorrow when people are around, but first you have to
> file a
> CMR... :-)
>
> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
>
>  Unfortunately, this pinpoint the fact that we didn't test enough the
>> collective module mixing thing. I went over the tuned collective
>> functions
>> and changed all instances to use the correct module information. It is
>> now
>> on the trunk, revision 20267. Simultaneously,I checked that all other
>> collective components do the right thing ... and I have to admit tuned
>> was
>> the only faulty one.
>>
>> This is clearly a bug in the tuned, and correcting it will allow
>> people
>> to use the hierarch. In the current incarnation 1.3 will mostly/always
>> segfault when hierarch is active. I would prefer not to give a broken
>> toy
>> out there. How about pushing r20267 in the 1.3?
>>
>> george.
>>
>>
>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
>>
>>  Thanks for digging into this.  Can you file a bug?  Let's mark it for
>>> v1.3.1.
>>>
>>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch,
>>> and
>>> since hierarch isn't currently selected by default (you must
>>> specifically
>>> elevate hierarch's priority to get it to run), there's no danger that
>>> users
>>> will run into this problem in default runs.
>>>
>>> But clearly the problem needs to be fixed, and therefore we need a
>>> bug
>>> to track it.
>>>
>>>
>>>
>>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
>>>
>>>  I just debugged the Reduce_scatter bug mentioned previously. The bug
 is
 unfortunately not in hierarch, but in tuned.

 Here is the code snipplet causing the problems:

 int reduce_scatter (, mca_coll_base_module_t *module)
 {
 ...
 err = comm->c_coll.coll_reduce (, module)
 ...
 }


 but should be
 {
 ...
 err = comm->c_coll.coll_reduce (...,
 comm->c_coll.coll_reduce_module);
 ...
 }

 The problem as it is right now is, that when using hierarch, only a
 subset of the function are set, e.g. reduce,allreduce, bcast and
 barrier.
 Thus, reduce_scatter is from tuned in most scenarios, and calls the
 subsequent functions with the wrong module. Hierarch of course does
 not like
 that :-)

 Anyway, a quick glance through the tuned code reveals a significant
 number of instances where this appears(reduce_scatter, allreduce,
 allgather,
 allgatherv). Basic, hierarch and inter seem to do that mostly
 correctly.

 Thanks
 Edgar
 --
 Edgar Gabriel
 Assistant Professor
 Parallel Software Technologies Lab  

[OMPI devel] Schedule for 1.3 Release Candidates and final Release

2008-12-02 Thread Brad Benton
Open MPI v1.3 is close to being ready for release.  The remaining defects
are being worked and our goal is to have those wrapped up and v1.3
officially released on December 9.  Here is the schedule leading up to the
final release:
  December 2: RC1
  December 5: RC2
  December 9: 1.3 Release

This has been a long effort involving many developers and organizations and
the release cycle has been frustratingly long, but after one more push, we
should have this out the door.  In order to help ensure the release goes
smoothly, please take some time to download and do some basic sanity checks
on the release candidates as they become available.

Thanks,
--brad

Brad Benton
Open MPI v1.3 co-release manager


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje  wrote:

> Jeff Squyres wrote:
>
>> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote:
>>
>>  Interesting. The self is only used for local communications. I don't
>>> expect that any benchmark execute such communications, but apparently I was
>>> wrong. Please let me know the failing test, I will take a look this evening.
>>>
>>
>> FWIW, my manual tests of a simplistic "ring" program work for all
>> combinations (openib, openib+self, openib+self+sm).  Shrug.
>>
>> But for OSU latency, I found that openib, openib+sm work, but
>> openib+sm+self hangs (same results whether the 2 procs are on the same node
>> or different nodes).  There is no self communication in osu_latency, so
>> something else must be going on.
>>
>>  Is it something to do with the MPI_Barrier call?  osu_latency uses
> MPI_Barrier and from rhc's email it sounds like his code does too.


I don't think it's an issue with MPI_Barrier().  I'm running into this
problem with srtest.c (one of the example programs from the mpich
distribution).  It's a ring-type test with no barriers until the end, yet it
hangs on the very first Send/Recv pair from rank0 to rank1.

I my case, openib and openib+sm works, but openib+self & openib+sm+self
hang.

--brad


>
> --td
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] trunk hangs since r19010

2008-07-28 Thread Brad Benton
My experience is the same a Lenny's.  I've tested on x86_64 and ppc64
systems and tests using --mca btl  openib,self hang in all cases.

--brad


2008/7/28 Lenny Verkhovsky 

> I failed to run on different nodes or on the same node via self,openib
>
>
>
> On 7/28/08, Ralph Castain  wrote:
>>
>> I checked this out some more and I believe it is ticket #1378 related. We
>> lock up if SM is included in the BTL's, which is what I had done on my test.
>> If I ^sm, I can run fine.
>>
>> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote:
>>
>> It could also be something new. Brad and I noted on Fri that IB was
>> locking up as soon as we tried any cross-node communications. Hadn't seen
>> that before, and at least I haven't explored it further - planned to do so
>> today.
>>
>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote:
>>
>> I believe it it.
>>
>> On 7/28/08, Jeff Squyres  wrote:
>>>
>>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote:
>>>
>>>  Is this related to r1378?

>>>
>>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket.
>>>
>>>
>>>  On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote:

  Hi,
>
> I experience hanging of tests ( latency ) since r19010
>
>
> Best Regards
>
> Lenny.
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


 --
 Jeff Squyres
 Cisco Systems


>>>
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] v1.3 RM: need a ruling

2008-07-10 Thread Brad Benton
I think this is very reasonable to go ahead and include for 1.3.  I find
that preferable to a 1.3-specific "wonky" workaround.  Plus, this sounds
like something that is very good to have in general.

--brad


On Wed, Jul 9, 2008 at 8:49 PM, Jeff Squyres  wrote:

> v1.3 RMs: Due to some recent work, the MCA parameter mpi_paffinity_alone
> disappeared -- it was moved and renamed to be opal_paffinity_alone.  This is
> Bad because we have a lot of historical precent based on the MCA param name
> "mpi_paffinity_alone" (FAQ, PPT presentations, e-mails on public lists,
> etc.).  So it needed to be restored for v1.3.  I just noticed that I hadn't
> opened a ticket on this -- sorry -- I opened #1383 tonight.
>
> For a variety of reasons described in the commit message r1383, Lenny and I
> first decided that it would be best to fix this problem by the functionality
> committed in r18770 (have the ability to find out where an MCA parameter was
> set).  This would allow us to register two MCA params: mpi_paffinity_alone
> and opal_paffinity_alone, and generally do the Right Thing (because we could
> then tell if a user had set a value or whether it was a default MCA param
> value).  This functionality will also be useful in the openib BTL, where
> there is a blend of MCA parameters and INI file parameters.
>
> However, after doing that, it seemed like only a few more steps to
> implement an overall better solution: implement "synonyms" for MCA
> parameters.  I.e., register the name "mpi_paffinity_alone" as a synonym for
> opal_paffinity_alone.  Along the way, it was trivial to add a "deprecated"
> flag for MCA parameters that we no longer want to use anymore (this
> deprecated flag is also useful in the OB1 PML and openib BTL).
>
> So to fix a problem that needed to be fixed for v1.3 (restore the MCA
> parameter "mpi_paffinity_alone"), I ended up implementing new functionality.
>
> Can this go into v1.3, or do we need to implement some kind of alternate
> fix?  (I admit to not having thought through what it would take to fix
> without the new MCA parameter functionality -- it might be kinda wonky)
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] Branching the trunk for 1.3, and branch update policy

2008-06-24 Thread Brad Benton
All:

The trunk is now ready to be branched for the 1.3 release.  Jeff has
volunteered to do the branch in his copious spare time (thanks Jeff!).  So,
I expect that to happen either this evening or sometime tomorrow.

There are numerous tickets to resolve to get 1.3 ready for release.  So, for
the time being (i.e., the next 2-4 weeks) we will not be formally managing
updates to the 1.3 branch.  However, please relate all checkins to one or
more tickets.  Also, please ensure that large changes get reviewed before
checking in.

Thanks,
--brad


Re: [OMPI devel] warn when fork() used with openib btl?

2008-06-24 Thread Brad Benton
I think this is a good idea, and for the reasons you outline in your
Rationale.  This definitely bites people from time to time at Big Blue as
well, and a gentle warning will certainly help.

--brad


On Mon, Jun 23, 2008 at 8:42 AM, Jeff Squyres  wrote:

> Those who care about the openib BTL:
>
> What do you think about warning when fork() is used with the openib BTL?
>  See https://svn.open-mpi.org/trac/ompi/ticket/1244 for details.
>
> Rationale: Several Cisco customers have been bitten by not realizing that
> they had calls to system() in their MPI codes when switching away from older
> mVAPI-based stacks to OFED (the older Cisco/Topspin mVAPI stack was a bit
> more tolerable of fork()).  Newer kernels and OFED versions can handle
> fork() better, but I've still had spurious reports of MPI codes failing when
> system() was used (never had much chance to followup to see what was
> actually happening, though -- it *should* have worked...?).
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Memory hooks change testing

2008-06-11 Thread Brad Benton
np...i'll give it another try (and will correspondingly endeavor to mollify
the mercurial gods
as best i can).

thx,
--brad


On Wed, Jun 11, 2008 at 4:50 PM, Brian W. Barrett 
wrote:

> Brad unfortunately figured out I had done something to annoy the gods of
> mercurial and the repository below didn't contain all the changes
> advertised (and in fact didn't work).  I've since rebuilt the repository
> and verified it works now.  I'd recommend deleting your existing clones of
> the repository below and starting over.
>
> Sorry about that,
>
> Brian
>
>
> On Wed, 11 Jun 2008, Brian Barrett wrote:
>
> > Did anyone get a chance to test (or think about testing) this?  I'd like
> to
> > commit the changes on Friday evening, if I haven't heard any negative
> > feedback.
> >
> > Brian
> >
> > On Jun 9, 2008, at 8:56 PM, Brian Barrett wrote:
> >
> >> Hi all -
> >>
> >> Per the RFC I sent out last week, I've implemented a revised behavior of
> >> the memory hooks for high speed networks.  It's a bit different than the
> >> RFC proposed, but still very minor and fairly straight foward.
> >>
> >> The default is to build ptmalloc2 support, but as an almost complete
> >> standalone library.  If the user wants to use ptmalloc2, he only has to
> add
> >> -lopenmpi-malloc to his link line.  Even when standalone and
> openmpi-malloc
> >> is not linked in, we'll still intercept munmap as it's needed for
> mallopt
> >> (below) and we've never had any trouble with that part of ptmalloc2
> (it's
> >> easy to intercept).
> >>
> >> As a *CHANGE* in behavior, if leave_pinned support is turned on and
> there's
> >> no ptmalloc2 support, we will automatically enable mallopt.  As a
> *CHANGE*
> >> in behavior, if the user disables mallopt or mallopt is not available
> and
> >> leave pinned is requested, we'll abort.  I think these both make sense
> and
> >> are closest to expected behavior, but wanted to point them out.  It is
> >> possible for the user to disable mallopt and enable leave_pinned, but
> that
> >> will *only* work if there is some other mechanism for intercepting free
> >> (basically, it allows a way to ensure your using ptmalloc2 instead of
> >> mallopt).
> >>
> >> There is also a new memory component, mallopt, which only intercepts
> munmap
> >> and exists only to allow users to enable mallopt while not building the
> >> ptmalloc2 component at all.  Previously, our mallopt support was lacking
> in
> >> that it didn't cover the case where users explicitly called munmap in
> their
> >> applications.  Now, it does.
> >>
> >> The changes are fairly small and can be seen/tested in the HG repository
> >> bwb/mem-hooks, URL below.  I think this would be a good thing to push to
> >> 1.3, as it will solve an ongoing problem on Linux (basically, users
> getting
> >> screwed by our ptmalloc2 implementation).
> >>
> >>   http://www.open-mpi.org/hg/hgwebdir.cgi/bwb/mem-hooks/
> >>
> >> Brian
> >>
> >> --
> >> Brian Barrett
> >> Open MPI developer
> >> http://www.open-mpi.org/
> >>
> >>
> >
> >
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Trunk check-in policy until the branch for 1.3

2008-05-20 Thread Brad Benton
2008/5/20 Richard Graham <rlgra...@ornl.gov>:

>  Brad,
>   Do you want these for bug fixes too ?
>

I think that it's okay to check in small bug fixes without a ticket.  I know
this is a somewhat nebulous guideline, but I'm thinking bug fixes of a few
lines as being "small".  So, unless George has an objection, I'm fine with
that.

--Brad



>
> Rich
>
>
>
> On 5/20/08 5:53 PM, "Brad Benton" <bradford.ben...@gmail.com> wrote:
>
> All:
>
> In order to better track changes on the trunk until we branch for 1.3, we
> (the release managers) would like to ask that all trunk checkins have
> corresponding tickets associated with them.  This will help us to keep
> better track of the state of the trunk prior to branching.  Note, this is
> just until we branch, which, hopefully, will be in a few days.
>
> The plan is to branch the trunk for 1.3 this Friday evening (May 23).
>  However, depending on the state of the trunk and the final items to get in
> before the branch, we might decide to delay the branch until the following
> Tuesday (May 27).  George, Jeff & I will discuss this on Friday afternoon
> and will send out the final plan for branching (Friday or Tuesday) at that
> time.
>
> Thanks,
> --Brad
>
> --
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] Trunk check-in policy until the branch for 1.3

2008-05-20 Thread Brad Benton
All:

In order to better track changes on the trunk until we branch for 1.3, we
(the release managers) would like to ask that all trunk checkins have
corresponding tickets associated with them.  This will help us to keep
better track of the state of the trunk prior to branching.  Note, this is
just until we branch, which, hopefully, will be in a few days.

The plan is to branch the trunk for 1.3 this Friday evening (May 23).
However, depending on the state of the trunk and the final items to get in
before the branch, we might decide to delay the branch until the following
Tuesday (May 27).  George, Jeff & I will discuss this on Friday afternoon
and will send out the final plan for branching (Friday or Tuesday) at that
time.

Thanks,
--Brad


Re: [OMPI devel] v1.3 Feature Freeze in effect

2008-05-14 Thread Brad Benton
Yes...bug fixes are definitely allowed.  We won't go to controlled commits
until after branching.

--brad


On Wed, May 14, 2008 at 8:39 AM, Terry Dontje <terry.don...@sun.com> wrote:

> I am right to assume that bug fixes are allowed.
>
> --td
>
> Brad Benton wrote:
> > All:
> >
> > As of today (May 13, 2008), the trunk is under v1.3 feature freeze
> > until it is stabilized and branched (targeted for May 23, 2008).  Here
> > are the guidelines for activity in the trunk while we are under the
> > v1.3 feature freeze:
> >
> >1. New components can still be checked into the trunk, but do so
> >   with .ompi_ignore so that they can be filtered out at branch
> >   time. Also, for newly checked in components, enter a trac ticket
> >   as a reminder to clean up the .ompi_ignore, if necessary, after
> >   the branch
> >
> >2. The following items have /special/ dispensation to come into the
> >   trunk late (cutoff date for them is May 20)
> >   * Final parts of 1.3 Thread Multiple Support
> >   * Send & Receive changes for improved latency (#1250
> > <https://svn.open-mpi.org/trac/ompi/ticket/1250>)
> >   * XML component for orte_output() and friends
> >
> >
> > Thanks,
> > --Brad
> >
> >
> > Brad Benton
> > Open MPI v1.3 co-release manager
> >
> > 
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


[OMPI devel] v1.3 Feature Freeze in effect

2008-05-13 Thread Brad Benton
All:

As of today (May 13, 2008), the trunk is under v1.3 feature freeze until it
is stabilized and branched (targeted for May 23, 2008).  Here are the
guidelines for activity in the trunk while we are under the v1.3 feature
freeze:

   1. New components can still be checked into the trunk, but do so with
   .ompi_ignore so that they can be filtered out at branch time. Also,
   for newly checked in components, enter a trac ticket as a reminder to clean
   up the .ompi_ignore, if necessary, after the branch

   2. The following items have *special* dispensation to come into the
   trunk late (cutoff date for them is May 20)
  - Final parts of 1.3 Thread Multiple Support
  - Send & Receive changes for improved latency
(#1250<https://svn.open-mpi.org/trac/ompi/ticket/1250>
  )
  - XML component for orte_output() and friends


Thanks,
--Brad


Brad Benton
Open MPI v1.3 co-release manager


[OMPI devel] 1.3 Release schedule and contents

2008-02-11 Thread Brad Benton
All:

The latest scrub of the 1.3 release schedule and contents is ready for
review and comment.  Please use the following links:
  1.3 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
  1.3.1 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1

In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff
(particularly ORTE things) to 1.3.1.  Even though there will be new
functionality slated for 1.3.1, the goal is to not have any interface
changes between the phases.

Please look over the list and schedules and let me or my fellow
1.3co-release manager George Bosilca (
bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions,
heartburn, etc.

Thanks,
--Brad

Brad Benton
IBM