Re: [OMPI devel] regression with derived datatypes

2014-05-29 Thread George Bosilca
r31904 should fix this issue. Please test it thoughtfully and report all issues.

  George.


On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
 wrote:
> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
> and attached a patch for the v1.8 branch
>
> i ran several tests from the intel_tests test suite and did not observe
> any regression.
>
> please note there are still issues when running with --mca btl
> scif,vader,self
>
> this might be an other issue, i will investigate more next week
>
> Gilles
>
> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>> I ran some more investigations with --mca btl scif,self
>>
>> i found that the previous patch i posted was complete crap and i
>> apologize for it.
>>
>> on a brighter side, and imho, the issue only occurs if fragments are
>> received (and then processed) out of order.
>> /* i did not observe this with the tcp btl, but i always see that with
>> the scif btl, i guess this can be observed too
>> with openib+RDMA */
>>
>> in this case only, opal_convertor_generic_simple_position(...) is
>> invoked and does not set the pConvertor->pStack
>> as expected by r31496
>>
>> i will run some more tests from now
>>
>> Gilles
>>
>> On 2014/05/08 2:23, George Bosilca wrote:
>>> Strange. The outcome and the timing of this issue seems to highlight a link 
>>> with the other datatype-related issue you reported earlier, and as 
>>> suggested by Ralph with Gilles scif+vader issue.
>>>
>>> Generally speaking, the mechanism used to split the data in the case of 
>>> multiple BTLs, is identical to the one used to split the data in fragments. 
>>> So, if the culprit is in the splitting logic, one might see some weirdness 
>>> as soon as we force the exclusive usage of the send protocol, with an 
>>> unconventional fragment size.
>>>
>>> In other words using the following flags “—mca btl tcp,self —mca 
>>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 
>>> 23 —mca btl_tcp_max_send_size 23” should always transfer wrong data, even 
>>> when only one single BTL is in play.
>>>
>>>   George.
>>>
>>> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>>>
 OK.  So, I investigated a little more.  I only see the issue when I am 
 running with multiple ports enabled such that I have two openib BTLs 
 instantiated.  In addition, large message RDMA has to be enabled.  If 
 those conditions are not met, then I do not see the problem.  For example:
 FAILS:
 Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
 mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
 PASS:
 Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
 btl_openib_flags 3 MPI_Isend_ator_c
 Ø  mpirun –np 2 –host host1,host2 –mca 
 btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
 MPI_Isend_ator_c

 So we must have some type of issue when we break up the message between 
 the two openib BTLs.  Maybe someone else can confirm my observations?
 I was testing against the latest trunk.

>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php


Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework

2014-05-29 Thread Edgar Gabriel
sounds good to me too.

Edgar

On 5/29/2014 10:04 AM, Joshua Ladd wrote:
> +1 I'm interested in hearing more. RTE is of interest.
> 
> Josh
> 
> 
> On Thu, May 29, 2014 at 10:33 AM, Ralph Castain  > wrote:
> 
> +1 for me!
> 
> On May 29, 2014, at 7:26 AM, Thomas Naughton  > wrote:
> 
> > Hi,
> >
> > Thanks Jeff, I think that was a pretty good summary of things.
> >
> >> Thomas indicated there was no rush on the RFC; perhaps we can
> discuss this next-next-Tuesday (June 10)?
> >
> > Phone discussion seems like a good idea and June 10 sounds good to me.
> >
> > Thanks,
> > --tjn
> >
> >
> _
> >  Thomas Naughton
>  naught...@ornl.gov 
> >  Research Associate   (865)
> 576-4184 
> >
> >
> > On Thu, 29 May 2014, Jeff Squyres (jsquyres) wrote:
> >
> >> I refrained from speaking up on this thread because I was on
> travel, and I wanted to think a bit more about this before I said
> anything.
> >>
> >> Let me try to summarize the arguments that have been made so far...
> >>
> >> A. Things people seem to agree on:
> >>
> >> 1. Inclusion in trunk has no correlation to being included in a
> release
> >> 2. Prior examples of (effectively) single-organization components
> >>
> >> B. Reasons to have STCI/HPX/etc. components in SVN trunk:
> >>
> >> 1. Multiple organizations are asking (ORNL, UTK, UH)
> >> 2. Easier to develop/merge the STCI/HPX/etc. components over time
> >> 3. Find all alternate RTE components in one place (vs. multiple
> internet repos)
> >> 4. More examples of how to use the RTE framework
> >>
> >> C. Reasons not to have STCI/HPX/etc. components in the SVN trunk:
> >>
> >> 1. What is the (technical) gain is for being in the trunk?
> >> 2. Concerns about external release schedule pressure
> >> 3. Why have something on the trunk if it's not eventually
> destined for a release?
> >>
> >> In particular, I think B2 and C1 seem to be in conflict with each
> other.
> >>
> >> I have several thoughts about this topic, but I'm hesitant to
> continue this already lengthy thread on a contentious topic.  I also
> don't want to spend the next 30 minutes writing a lengthy,
> carefully-worded email that will just spawn further lengthy,
> carefully-worded emails (each costing 15-30 minutes).  Prior history
> has shown that we discuss and resolve issues much more rationally on
> the phone (vs. email hell).
> >>
> >> I would therefore like to discuss this on a weekly Tuesday call.
> >>
> >> Next week is bad because it's the MPI Forum meeting; I suspect
> that some -- but not all -- of us will not be on the Tuesday call
> because we'll be at the Forum.
> >>
> >> Thomas indicated there was no rush on the RFC; perhaps we can
> discuss this next-next-Tuesday (June 10)?
> >>
> >>
> >>
> >>
> >> On May 27, 2014, at 12:25 PM, Thomas Naughton  > wrote:
> >>
> >>>
> >>> WHAT:  add new component to ompi/rte framework
> >>>
> >>> WHY:   because it will simplify our maintenance & provide an
> alt. reference
> >>>
> >>> WHEN:  no rush, soon-ish? (June 12?)
> >>>
> >>> This is a component we currently maintain outside of the ompi
> tree to
> >>> support using OMPI with an alternate runtime system.  This will also
> >>> provide an alternate component to ORTE, which was motivation for PMI
> >>> component in related RFC.   We build/test nightly and it
> occasionally
> >>> catches ompi-rte abstraction violations, etc.
> >>>
> >>> Thomas
> >>>
> >>>
> _
> >>> Thomas Naughton
>  naught...@ornl.gov 
> >>> Research Associate   (865)
> 576-4184 
> >>>
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org 
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14852.php
> >>
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com 
> >> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org 
> >

[OMPI devel] Intermittent hangs when exiting with error

2014-05-29 Thread Rolf vandeVaart
Ralph:
I am seeing cases where mpirun seems to hang when one of the applications exits 
with non-zero.  For example, the intel test MPI_Cart_get_c will exit that way 
if there are not enough processes to run the test.  In most cases, mpirun seems 
to return fine with the error code, but sometimes it just hangs.   I first 
started noticing this in my mtt runs.  It seems (but not conclusive) that I see 
this when both the usnic and openib are built, even though I am only using the 
openib (as I have no usnic hardware).

Anyone else seeing something like this?  Note that I see this on both 1.8 and 
trunk, but I show trunk here.


PASS:
[rvandevaart@drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca 
btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST info  (0): Starting MPI_Cart_get  test
MPITEST skip (0): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (3): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (2): WARNING --  nodes =   4   Need   6 nodes to run test
---
Primary job  terminated normally, but 1 process returned a non-zero exit code.. 
Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero status, thus 
causing the job to be terminated. The first process to do so was:

  Process name: [[45854,1],1]
  Exit code:77
--

FAIL:
[rvandevaart@drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca 
btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST info  (0): Starting MPI_Cart_get  test
MPITEST skip (0): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (3): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (2): WARNING --  nodes =   4   Need   6 nodes to run test
---
Primary job  terminated normally, but 1 process returned a non-zero exit code.. 
Per user-direction, the job has been aborted.
---
[...now we are hung...]

LOCAL mpirun:
[rvandevaart@drossetti-ivy0 64-mtt-nocuda]$ pstack 27705 Thread 2 (Thread 
0x7fe0c8c47700 (LWP 27706)):
#0  0x7fe0ca578533 in select () from /lib64/libc.so.6
#1  0x7fe0c8c5591e in listen_thread () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/openmpi/mca_oob_tcp.so
#2  0x7fe0ca831851 in start_thread () from /lib64/libpthread.so.0
#3  0x7fe0ca57f94d in clone () from /lib64/libc.so.6 Thread 1 (Thread 
0x7fe0cbcdd700 (LWP 27705)):
#0  0x7fe0ca576293 in poll () from /lib64/libc.so.6
#1  0x7fe0cb589575 in poll_dispatch () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2  0x7fe0cb57df8c in opal_libevent2021_event_base_loop () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3  0x00405572 in orterun ()
#4  0x00403904 in main ()
[rvandevaart@drossetti-ivy0 64-mtt-nocuda]$

REMOTE ORTED:
[rvandevaart@drossetti-ivy1 ~]$ pstack 10241
#0  0x7fbdcba7c258 in poll () from /lib64/libc.so.6
#1  0x7fbdcca8f575 in poll_dispatch () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2  0x7fbdcca83f8c in opal_libevent2021_event_base_loop () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3  0x7fbdccd572cc in orte_daemon () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-rte.so.0
#4  0x0040094a in main ()
[rvandevaart@drossetti-ivy1 ~]$


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework

2014-05-29 Thread Joshua Ladd
+1 I'm interested in hearing more. RTE is of interest.

Josh


On Thu, May 29, 2014 at 10:33 AM, Ralph Castain  wrote:

> +1 for me!
>
> On May 29, 2014, at 7:26 AM, Thomas Naughton  wrote:
>
> > Hi,
> >
> > Thanks Jeff, I think that was a pretty good summary of things.
> >
> >> Thomas indicated there was no rush on the RFC; perhaps we can discuss
> this next-next-Tuesday (June 10)?
> >
> > Phone discussion seems like a good idea and June 10 sounds good to me.
> >
> > Thanks,
> > --tjn
> >
> > _
> >  Thomas Naughton  naught...@ornl.gov
> >  Research Associate   (865) 576-4184
> >
> >
> > On Thu, 29 May 2014, Jeff Squyres (jsquyres) wrote:
> >
> >> I refrained from speaking up on this thread because I was on travel,
> and I wanted to think a bit more about this before I said anything.
> >>
> >> Let me try to summarize the arguments that have been made so far...
> >>
> >> A. Things people seem to agree on:
> >>
> >> 1. Inclusion in trunk has no correlation to being included in a release
> >> 2. Prior examples of (effectively) single-organization components
> >>
> >> B. Reasons to have STCI/HPX/etc. components in SVN trunk:
> >>
> >> 1. Multiple organizations are asking (ORNL, UTK, UH)
> >> 2. Easier to develop/merge the STCI/HPX/etc. components over time
> >> 3. Find all alternate RTE components in one place (vs. multiple
> internet repos)
> >> 4. More examples of how to use the RTE framework
> >>
> >> C. Reasons not to have STCI/HPX/etc. components in the SVN trunk:
> >>
> >> 1. What is the (technical) gain is for being in the trunk?
> >> 2. Concerns about external release schedule pressure
> >> 3. Why have something on the trunk if it's not eventually destined for
> a release?
> >>
> >> In particular, I think B2 and C1 seem to be in conflict with each other.
> >>
> >> I have several thoughts about this topic, but I'm hesitant to continue
> this already lengthy thread on a contentious topic.  I also don't want to
> spend the next 30 minutes writing a lengthy, carefully-worded email that
> will just spawn further lengthy, carefully-worded emails (each costing
> 15-30 minutes).  Prior history has shown that we discuss and resolve issues
> much more rationally on the phone (vs. email hell).
> >>
> >> I would therefore like to discuss this on a weekly Tuesday call.
> >>
> >> Next week is bad because it's the MPI Forum meeting; I suspect that
> some -- but not all -- of us will not be on the Tuesday call because we'll
> be at the Forum.
> >>
> >> Thomas indicated there was no rush on the RFC; perhaps we can discuss
> this next-next-Tuesday (June 10)?
> >>
> >>
> >>
> >>
> >> On May 27, 2014, at 12:25 PM, Thomas Naughton 
> wrote:
> >>
> >>>
> >>> WHAT:  add new component to ompi/rte framework
> >>>
> >>> WHY:   because it will simplify our maintenance & provide an alt.
> reference
> >>>
> >>> WHEN:  no rush, soon-ish? (June 12?)
> >>>
> >>> This is a component we currently maintain outside of the ompi tree to
> >>> support using OMPI with an alternate runtime system.  This will also
> >>> provide an alternate component to ORTE, which was motivation for PMI
> >>> component in related RFC.   We build/test nightly and it occasionally
> >>> catches ompi-rte abstraction violations, etc.
> >>>
> >>> Thomas
> >>>
> >>>
> _
> >>> Thomas Naughton
> naught...@ornl.gov
> >>> Research Associate   (865) 576-4184
> >>>
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14852.php
> >>
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14904.php
> >>
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14905.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14906.php
>


Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework

2014-05-29 Thread Ralph Castain
+1 for me!

On May 29, 2014, at 7:26 AM, Thomas Naughton  wrote:

> Hi,
> 
> Thanks Jeff, I think that was a pretty good summary of things.
> 
>> Thomas indicated there was no rush on the RFC; perhaps we can discuss this 
>> next-next-Tuesday (June 10)?
> 
> Phone discussion seems like a good idea and June 10 sounds good to me.
> 
> Thanks,
> --tjn
> 
> _
>  Thomas Naughton  naught...@ornl.gov
>  Research Associate   (865) 576-4184
> 
> 
> On Thu, 29 May 2014, Jeff Squyres (jsquyres) wrote:
> 
>> I refrained from speaking up on this thread because I was on travel, and I 
>> wanted to think a bit more about this before I said anything.
>> 
>> Let me try to summarize the arguments that have been made so far...
>> 
>> A. Things people seem to agree on:
>> 
>> 1. Inclusion in trunk has no correlation to being included in a release
>> 2. Prior examples of (effectively) single-organization components
>> 
>> B. Reasons to have STCI/HPX/etc. components in SVN trunk:
>> 
>> 1. Multiple organizations are asking (ORNL, UTK, UH)
>> 2. Easier to develop/merge the STCI/HPX/etc. components over time
>> 3. Find all alternate RTE components in one place (vs. multiple internet 
>> repos)
>> 4. More examples of how to use the RTE framework
>> 
>> C. Reasons not to have STCI/HPX/etc. components in the SVN trunk:
>> 
>> 1. What is the (technical) gain is for being in the trunk?
>> 2. Concerns about external release schedule pressure
>> 3. Why have something on the trunk if it's not eventually destined for a 
>> release?
>> 
>> In particular, I think B2 and C1 seem to be in conflict with each other.
>> 
>> I have several thoughts about this topic, but I'm hesitant to continue this 
>> already lengthy thread on a contentious topic.  I also don't want to spend 
>> the next 30 minutes writing a lengthy, carefully-worded email that will just 
>> spawn further lengthy, carefully-worded emails (each costing 15-30 minutes). 
>>  Prior history has shown that we discuss and resolve issues much more 
>> rationally on the phone (vs. email hell).
>> 
>> I would therefore like to discuss this on a weekly Tuesday call.
>> 
>> Next week is bad because it's the MPI Forum meeting; I suspect that some -- 
>> but not all -- of us will not be on the Tuesday call because we'll be at the 
>> Forum.
>> 
>> Thomas indicated there was no rush on the RFC; perhaps we can discuss this 
>> next-next-Tuesday (June 10)?
>> 
>> 
>> 
>> 
>> On May 27, 2014, at 12:25 PM, Thomas Naughton  wrote:
>> 
>>> 
>>> WHAT:  add new component to ompi/rte framework
>>> 
>>> WHY:   because it will simplify our maintenance & provide an alt. reference
>>> 
>>> WHEN:  no rush, soon-ish? (June 12?)
>>> 
>>> This is a component we currently maintain outside of the ompi tree to
>>> support using OMPI with an alternate runtime system.  This will also
>>> provide an alternate component to ORTE, which was motivation for PMI
>>> component in related RFC.   We build/test nightly and it occasionally
>>> catches ompi-rte abstraction violations, etc.
>>> 
>>> Thomas
>>> 
>>> _
>>> Thomas Naughton  naught...@ornl.gov
>>> Research Associate   (865) 576-4184
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/05/14852.php
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to: 
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14904.php
>> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14905.php



Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework

2014-05-29 Thread Thomas Naughton

Hi,

Thanks Jeff, I think that was a pretty good summary of things.

Thomas indicated there was no rush on the RFC; perhaps we can 
discuss this next-next-Tuesday (June 10)?


Phone discussion seems like a good idea and June 10 sounds good to me.

Thanks,
--tjn

 _
  Thomas Naughton  naught...@ornl.gov
  Research Associate   (865) 576-4184


On Thu, 29 May 2014, Jeff Squyres (jsquyres) wrote:


I refrained from speaking up on this thread because I was on travel, and I 
wanted to think a bit more about this before I said anything.

Let me try to summarize the arguments that have been made so far...

A. Things people seem to agree on:

1. Inclusion in trunk has no correlation to being included in a release
2. Prior examples of (effectively) single-organization components

B. Reasons to have STCI/HPX/etc. components in SVN trunk:

1. Multiple organizations are asking (ORNL, UTK, UH)
2. Easier to develop/merge the STCI/HPX/etc. components over time
3. Find all alternate RTE components in one place (vs. multiple internet repos)
4. More examples of how to use the RTE framework

C. Reasons not to have STCI/HPX/etc. components in the SVN trunk:

1. What is the (technical) gain is for being in the trunk?
2. Concerns about external release schedule pressure
3. Why have something on the trunk if it's not eventually destined for a 
release?

In particular, I think B2 and C1 seem to be in conflict with each other.

I have several thoughts about this topic, but I'm hesitant to continue this 
already lengthy thread on a contentious topic.  I also don't want to spend the 
next 30 minutes writing a lengthy, carefully-worded email that will just spawn 
further lengthy, carefully-worded emails (each costing 15-30 minutes).  Prior 
history has shown that we discuss and resolve issues much more rationally on 
the phone (vs. email hell).

I would therefore like to discuss this on a weekly Tuesday call.

Next week is bad because it's the MPI Forum meeting; I suspect that some -- but 
not all -- of us will not be on the Tuesday call because we'll be at the Forum.

Thomas indicated there was no rush on the RFC; perhaps we can discuss this 
next-next-Tuesday (June 10)?




On May 27, 2014, at 12:25 PM, Thomas Naughton  wrote:



WHAT:  add new component to ompi/rte framework

WHY:   because it will simplify our maintenance & provide an alt. reference

WHEN:  no rush, soon-ish? (June 12?)

This is a component we currently maintain outside of the ompi tree to
support using OMPI with an alternate runtime system.  This will also
provide an alternate component to ORTE, which was motivation for PMI
component in related RFC.   We build/test nightly and it occasionally
catches ompi-rte abstraction violations, etc.

Thomas

_
 Thomas Naughton  naught...@ornl.gov
 Research Associate   (865) 576-4184

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14852.php



--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14904.php



Re: [OMPI devel] RFC: add STCI component to OMPI/RTE framework

2014-05-29 Thread Jeff Squyres (jsquyres)
I refrained from speaking up on this thread because I was on travel, and I 
wanted to think a bit more about this before I said anything.

Let me try to summarize the arguments that have been made so far...

A. Things people seem to agree on:

1. Inclusion in trunk has no correlation to being included in a release
2. Prior examples of (effectively) single-organization components

B. Reasons to have STCI/HPX/etc. components in SVN trunk:

1. Multiple organizations are asking (ORNL, UTK, UH)
2. Easier to develop/merge the STCI/HPX/etc. components over time
3. Find all alternate RTE components in one place (vs. multiple internet repos)
4. More examples of how to use the RTE framework

C. Reasons not to have STCI/HPX/etc. components in the SVN trunk:

1. What is the (technical) gain is for being in the trunk?
2. Concerns about external release schedule pressure
3. Why have something on the trunk if it's not eventually destined for a 
release?

In particular, I think B2 and C1 seem to be in conflict with each other.

I have several thoughts about this topic, but I'm hesitant to continue this 
already lengthy thread on a contentious topic.  I also don't want to spend the 
next 30 minutes writing a lengthy, carefully-worded email that will just spawn 
further lengthy, carefully-worded emails (each costing 15-30 minutes).  Prior 
history has shown that we discuss and resolve issues much more rationally on 
the phone (vs. email hell).

I would therefore like to discuss this on a weekly Tuesday call.

Next week is bad because it's the MPI Forum meeting; I suspect that some -- but 
not all -- of us will not be on the Tuesday call because we'll be at the Forum.

Thomas indicated there was no rush on the RFC; perhaps we can discuss this 
next-next-Tuesday (June 10)?




On May 27, 2014, at 12:25 PM, Thomas Naughton  wrote:

> 
> WHAT:  add new component to ompi/rte framework
> 
> WHY:   because it will simplify our maintenance & provide an alt. reference
> 
> WHEN:  no rush, soon-ish? (June 12?)
> 
> This is a component we currently maintain outside of the ompi tree to
> support using OMPI with an alternate runtime system.  This will also
> provide an alternate component to ORTE, which was motivation for PMI
> component in related RFC.   We build/test nightly and it occasionally
> catches ompi-rte abstraction violations, etc.
> 
> Thomas
> 
> _
>  Thomas Naughton  naught...@ornl.gov
>  Research Associate   (865) 576-4184
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14852.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-29 Thread Ralph Castain
Not seeing it on fresh checkout of today's trunk head, so something may have 
resolved it since last test


On May 28, 2014, at 10:27 PM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> 
> On Wed, May 28, 2014 at 9:53 PM, Ralph Castain  wrote:
> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
>  ./configure --prefix=/home/common/openmpi/build/svn-trunk --enable-mpi-java 
> --enable-orterun-prefix-by-default
> 
> More inline below
> 
> 
> this looks like an up-to-date CentOS box.
> i am unable to reproduce the warnings (may be uninitialized in this function) 
> with a similar box :-(
> 
>  
> On May 27, 2014, at 9:29 PM, Gilles Gouaillardet 
>  wrote:
> so far, it seems this is a false positive/compiler bug that could be 
> triggered by inlining
>> /* i could not find any path in which the variable is used unitialized */
> 
> I just glanced at the first one (line 221 of osc_rdma_data_move.c), and I can 
> see what the compiler is complaining about - have gotten this in other places 
> as well. The problem is that you pass the address of ptr into a function 
> without first initializing the value of ptr itself. There is no guarantee (so 
> far as the compiler can see) that this function will in fact set the value of 
> ptr - you are relying solely on the fact that (a) you checked that function 
> at one point in time and saw that it always gets set to something if ret == 
> OMPI_SUCCESS, and (b) nobody changed that function since you checked.
> 
> Newer compilers seem to be getting more defensive about such things and 
> starting to "bark" when they see it. I think you are correct that inlining 
> also impacts that situation, though I've also been seeing it when the 
> functions aren't inlined.
> 
> 
> i wrote the simple test program :
> 
> #include 
> 
> char * mystring = "hello";
> static inline int setif(int mustset, char **ptr) {
> if (!mustset) {
> return 1;
> }
> *ptr = mystring;
> return 0;
> }
> 
> void good(int mustset) {
> char * ptr;
> char buf[256];
> if (setif(mustset, &ptr) == 0) {
> memcpy(buf, ptr, 6);
> }
> }
> 
> void bad(int mustset) {
> char * ptr;
> char buf[256];
> if (setif(mustset, &ptr) != 0) {
> memcpy(buf, ptr, 6);
> }
> }
> 
> please note that :
> - the setif function is declared 'inline'
> - the setif will set *ptr only if the 'mustset' parameter is nonzero and then 
> return 0
> - the setif will leave *ptr unmodified if the 'mustset' parameter is zero and 
> then return 1
> 
> it is trivial that the 'good' function is ok whereas the 'bad' function has 
> an issue :
> the compiler has a way to figure out that ptr will be uninitialized when 
> invoking memcpy
> (since setif returned a non zero status)
> 
> gcc -Wall -O0 test.c
> does not complain
> 
> gcc -Wall -O1 test.c *does* complain
> test.c:24: warning: ‘ptr’ may be used uninitialized in this function
> 
> if the 'inline' keyword is omitted, -O2 is needed to get a compiler warning.
> 
> bottom line, an optimized build (-O3 -finline-functions) correctly issues a 
> warning.
> i checked osc_rdma_data_move.c and osc_rdma_frag.h again and again and i 
> could not find how ptr can be uninitialized in ompi_osc_rdma_control_send if
> ompi_osc_rdma_frag_alloc returned OMPI_SUCCESS
> /* not to mention i am unable to reproduce the warning */
> 
> about the compiler getting more defensive :
> 
> { int rank;
>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>   rank++;
> }
> 
> i never saw a compiler issue a warning about rank that could be used 
> uninitialized
> 
>  
> Not sure what to suggest here - hate to add initialization steps in that 
> sequence
> 
> me too, and i do not see any warnings from the compiler
> 
> can you please confirm you can reproduce the issue on the most up to date 
> trunk revision , on a x86_64 box (never knows ...) ?
> then can you send the output of
> 
> cd ompi/mca/osc/rdma
> touch osc_rdma_data_move.c
> make -n osc_rdma_data_move.lo
> 
> 
> Cheers,
> 
> Gilles 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14902.php



Re: [OMPI devel] Trunk (RDMA and VT) warnings

2014-05-29 Thread Gilles Gouaillardet
Ralph,


On Wed, May 28, 2014 at 9:53 PM, Ralph Castain  wrote:

> gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
>  ./configure --prefix=/home/common/openmpi/build/svn-trunk
> --enable-mpi-java --enable-orterun-prefix-by-default
>
> More inline below
>
>
this looks like an up-to-date CentOS box.
i am unable to reproduce the warnings (may be uninitialized in this
function) with a similar box :-(



> On May 27, 2014, at 9:29 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> so far, it seems this is a false positive/compiler bug that could be
> triggered by inlining
>
> /* i could not find any path in which the variable is used unitialized */
>
>
> I just glanced at the first one (line 221 of osc_rdma_data_move.c), and I
> can see what the compiler is complaining about - have gotten this in other
> places as well. The problem is that you pass the address of ptr into a
> function without first initializing the value of ptr itself. There is no
> guarantee (so far as the compiler can see) that this function will in fact
> set the value of ptr - you are relying solely on the fact that (a) you
> checked that function at one point in time and saw that it always gets set
> to something if ret == OMPI_SUCCESS, and (b) nobody changed that function
> since you checked.
>
> Newer compilers seem to be getting more defensive about such things and
> starting to "bark" when they see it. I think you are correct that inlining
> also impacts that situation, though I've also been seeing it when the
> functions aren't inlined.
>
>
i wrote the simple test program :

#include 

char * mystring = "hello";
static inline int setif(int mustset, char **ptr) {
if (!mustset) {
return 1;
}
*ptr = mystring;
return 0;
}

void good(int mustset) {
char * ptr;
char buf[256];
if (setif(mustset, &ptr) == 0) {
memcpy(buf, ptr, 6);
}
}

void bad(int mustset) {
char * ptr;
char buf[256];
if (setif(mustset, &ptr) != 0) {
memcpy(buf, ptr, 6);
}
}

please note that :
- the setif function is declared 'inline'
- the setif will set *ptr only if the 'mustset' parameter is nonzero and
then return 0
- the setif will leave *ptr unmodified if the 'mustset' parameter is zero
and then return 1

it is trivial that the 'good' function is ok whereas the 'bad' function has
an issue :
the compiler has a way to figure out that ptr will be uninitialized when
invoking memcpy
(since setif returned a non zero status)

gcc -Wall -O0 test.c
does not complain

gcc -Wall -O1 test.c *does* complain
test.c:24: warning: ‘ptr’ may be used uninitialized in this function

if the 'inline' keyword is omitted, -O2 is needed to get a compiler warning.

bottom line, an optimized build (-O3 -finline-functions) correctly issues a
warning.
i checked osc_rdma_data_move.c and osc_rdma_frag.h again and again and i
could not find how ptr can be uninitialized in ompi_osc_rdma_control_send if
ompi_osc_rdma_frag_alloc returned OMPI_SUCCESS
/* not to mention i am unable to reproduce the warning */

about the compiler getting more defensive :

{ int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  rank++;
}

i never saw a compiler issue a warning about rank that could be used
uninitialized



> Not sure what to suggest here - hate to add initialization steps in that
> sequence
>
> me too, and i do not see any warnings from the compiler

can you please confirm you can reproduce the issue on the most up to date
trunk revision , on a x86_64 box (never knows ...) ?
then can you send the output of

cd ompi/mca/osc/rdma
touch osc_rdma_data_move.c
make -n osc_rdma_data_move.lo


Cheers,

Gilles