Re: [OMPI devel] Locality info

2011-10-19 Thread Ralph Castain

On Oct 19, 2011, at 5:05 PM, George Bosilca wrote:

> Wonderful!!! We've been waiting for such functionality for a while.

My pleasure :-)

> 
> I do have some questions/remarks related to this patch.
> 
> What is the my_node_rank in the orte_proc_info_t structure?

The node rank is a local ranking of procs on a node, starting with 0 for the 
lowest vpid on the node and going up from there. It normally was passed in the 
environment and picked up in the ess components so it could be used to select a 
static port during oob init, if those were specified.

I moved it to a more general place solely because I wanted to move a bunch of 
replicated code to the ess/base instead of having it in nearly every module. I 
debated about putting it in ess/base.h instead, but since other places in the 
code might also want it, figured I'd make it more globally available.

If it turns out nobody needs it, we can move it back into just the ess.

> Is there any difference between using the field my_node_rank or the vpid part 
> of the my_daemon?

Yes - my_daemon refers to the local daemon. The node rank refers solely to the 
relative ranking of application procs on the node.

> What is the correct way of finding that two processes are on the same remote 
> location, comparing their daemon vpid or their node_rank?

Daemon vpid

> How the node_rank change with respect to dynamic process management when new 
> daemons are joining?

This is where node_rank comes into play. The mapper sees across jobs that are 
sharing nodes, so the mapper currently is responsible for computing the 
node_rank of a proc. This info gets transmitted to all daemons, including new 
dynamically started ones, in the launch msg. So everyone always has a picture 
of the node_rank for every proc.

> 
> The flag OPAL_PROC_ON_L*CACHE is only set for local processes if I understand 
> correctly your last email?

Yes - all the locality flags refer only to the location of another process 
relative to you, you being an app process. As I said, though, this can easily 
be extended to return the relative locality of two procs on a remote node, if 
that would be of use.

> 
> I guess proc_flags in proc.h should be opal_paffinity_locality_t to match the 
> flags on the ORTE level?

My bad - I thought I had changed it? If not, it certainly needs to be...

> 
> A more high level remark. The fact that the locality information is 
> automatically packed and exchanged during the grpcomm modex call seems a 
> little bit weird (do the upper level have a saying on it?). I would not have 
> thought that the grpcomm (which based on the grpcomm.h header file is a 
> framework providing communication services that span entire jobs or 
> collections of processes) is the place to put it.

I agree - I wasn't entirely sure where to put it, frankly. It needs to be 
somewhere that both direct launch and mpirun-launched apps can see it. Could go 
in the MPI layer, I suppose.

Suggestions welcome!


> 
> Thanks,
>  george.
> 
> 
> On Oct 19, 2011, at 16:28 , Ralph Castain wrote:
> 
>> Hi folks
>> 
>> For those of you who don't follow the commits...
>> 
>> I just committed (r25323) an extension of the orte_ess.proc_get_locality 
>> function that allows a process to get its relative resource usage with any 
>> other proc in the job. In other words, you can provide a process name to the 
>> function, and the returned bitmask tells you if you share a node, numa, 
>> socket, caches (by level), core, and hyperthread with that process.
>> 
>> If you are on the same node and unbound, of course, you share all of those. 
>> However, if you are bound, then this can help tell you if you are on a 
>> common numa node, sharing an L1 cache, etc. Might be handy.
>> 
>> I implemented the underlying functionality so that we can further extend it 
>> to tell you the relative resource location of two procs on a remote node. If 
>> that someday becomes of interest, it would be relatively easy to do - but 
>> would require passing more info around. Hence, I've allowed for it, but not 
>> implemented it until there is some identified need.
>> 
>> Locality info is available anytime after the modex is completed during 
>> MPI_Init, and is supported regardless of launch environment (minus cnos, for 
>> now), launch by mpirun, or direct-launch - in other words, pretty much 
>> always.
>> 
>> Hope it proves of help in your work
>> Ralph
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Locality info

2011-10-19 Thread George Bosilca
Wonderful!!! We've been waiting for such functionality for a while.

I do have some questions/remarks related to this patch.

What is the my_node_rank in the orte_proc_info_t structure? Is there any 
difference between using the field my_node_rank or the vpid part of the 
my_daemon? What is the correct way of finding that two processes are on the 
same remote location, comparing their daemon vpid or their node_rank? How the 
node_rank change with respect to dynamic process management when new daemons 
are joining?

The flag OPAL_PROC_ON_L*CACHE is only set for local processes if I understand 
correctly your last email?

I guess proc_flags in proc.h should be opal_paffinity_locality_t to match the 
flags on the ORTE level?

A more high level remark. The fact that the locality information is 
automatically packed and exchanged during the grpcomm modex call seems a little 
bit weird (do the upper level have a saying on it?). I would not have thought 
that the grpcomm (which based on the grpcomm.h header file is a framework 
providing communication services that span entire jobs or collections of 
processes) is the place to put it.

Thanks,
  george.


On Oct 19, 2011, at 16:28 , Ralph Castain wrote:

> Hi folks
> 
> For those of you who don't follow the commits...
> 
> I just committed (r25323) an extension of the orte_ess.proc_get_locality 
> function that allows a process to get its relative resource usage with any 
> other proc in the job. In other words, you can provide a process name to the 
> function, and the returned bitmask tells you if you share a node, numa, 
> socket, caches (by level), core, and hyperthread with that process.
> 
> If you are on the same node and unbound, of course, you share all of those. 
> However, if you are bound, then this can help tell you if you are on a common 
> numa node, sharing an L1 cache, etc. Might be handy.
> 
> I implemented the underlying functionality so that we can further extend it 
> to tell you the relative resource location of two procs on a remote node. If 
> that someday becomes of interest, it would be relatively easy to do - but 
> would require passing more info around. Hence, I've allowed for it, but not 
> implemented it until there is some identified need.
> 
> Locality info is available anytime after the modex is completed during 
> MPI_Init, and is supported regardless of launch environment (minus cnos, for 
> now), launch by mpirun, or direct-launch - in other words, pretty much always.
> 
> Hope it proves of help in your work
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
There are several OPAL level error codes not used in the current code.

OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED
OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED
OPAL_ERR_TOPO_CORE_NOT_SUPPORTED
OPAL_ERR_NOT_ENOUGH_SOCKETS
OPAL_ERR_NOT_ENOUGH_CORES
OPAL_ERR_INVALID_PHYS_CPU
OPAL_ERR_MULTIPLE_AFFINITIES

If somebody feels like filling up an RFC to remove them, please feel free to go 
ahead.

  george.

On Oct 19, 2011, at 18:41 , George Bosilca wrote:

> A careful reading of the committed patch, would have pointed out that none of 
> the concerns raised so far were true, the "old-way" behavior of the OMPI code 
> was preserved. Moreover, every single of the error codes removed were not 
> used in ages.
> 
> What Brian pointed out as evil, evil being a subjective notion by itself, 
> didn't prevent the correct behavior of the code, nor affected in any way it's 
> correctness. Anyway, to address his concern I pushed a patch (25333) putting 
> the OMPI error codes back where they were originally.
> 
> In other words we spent a very unproductive day, arguing over unfounded 
> arguments and "thought-to-be" behaviors.
> 
>  george.
> 
> 
> On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:
> 
>> George -
>> 
>> I wrote the error code gorp; I'm pretty sure I know exactly how it was
>> supposed to work.
>> 
>> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
>> OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
>> THat's not the intent of the error code logic at all.  If you want to
>> change that, I'm not necessarily opposed to it, but that's something that
>> should be discussed in an RFC.  What the current code does is not
>> consistent with the original intent.
>> 
>> I don't agree that you shouldn't propagate error codes through OMPI; in
>> fact, the original intent of the design was to allow such propagation.
>> Again, such a change should be discussed as part of an RFC.
>> 
>> Brian
>> 
>> On 10/19/11 4:50 PM, "George Bosilca"  wrote:
>> 
>>> I don't know how you think that the error codes work in Open MPI, so I'll
>>> take the liberty to depict it here so we all agree we're talking about
>>> the same thing.
>>> 
>>> The opal_strerror is a nice feature, it allow to register a range of
>>> error codes with a particular error converter. Every time you look for
>>> the meaning of a particular error code, the first convertor with a range
>>> enveloping the looked at value, will translate it into an error string.
>>> 
>>> This is only currently used by OPAL and ORTE directly. It worked at the
>>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>>> ones. This behavior didn't change after my patch, you can still use
>>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>>> 
>>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>>> specific error code today. The OMPI error codes are actually inserted
>>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>>> there is __no__ possible overlap right now. If at one point we add more
>>> than 100 OMPI level, we should certainly revisit this.
>>> 
>>> Now, resulting from my patch, there is a difference. One should not
>>> simply forward an ORTE code into the stack of OMPI, and hope it just
>>> works. Errors should be dealt with where they happens, and if not
>>> possible they should be translated into the actual layer error code. The
>>> error propagation should be compartmentalized, and has to be translated
>>> into an error code that has a meaning at the OMPI level. The current
>>> patch should not prevent the mixed error-code code to work, as
>>> opal_strerror retains the same behavior as before. However, this coding
>>> practice should be avoided. I tried to clean the current code of such
>>> instances few days ago in r25230.
>>> 
>>> Moreover, this is similar to how we deal with the error codes between
>>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>>> deal with a specific layer error code when you get it (usually after the
>>> call to a function from that specific layer), not later on when you don't
>>> even know exactly what the execution path was.
>>> 
>>> george.
>>> 
>>> PS: I'll fix the +/- issue.
>>> 
>>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>>> 
 Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
 codes. That seems like a very bad idea (in addition to the mixing of +
 and -).
 
 For one thing, that breaks opal_strerror().  That, in itself, seems
 like a dealbreaker.
 
 
 On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
 
> I actually think it's worse than that.  An ORTE error code can now have
> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
> Or, they should, if George hadn't made a mistake (see below).  The
> sharing
> 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
A careful reading of the committed patch, would have pointed out that none of 
the concerns raised so far were true, the "old-way" behavior of the OMPI code 
was preserved. Moreover, every single of the error codes removed were not used 
in ages.

What Brian pointed out as evil, evil being a subjective notion by itself, 
didn't prevent the correct behavior of the code, nor affected in any way it's 
correctness. Anyway, to address his concern I pushed a patch (25333) putting 
the OMPI error codes back where they were originally.

In other words we spent a very unproductive day, arguing over unfounded 
arguments and "thought-to-be" behaviors.

  george.


On Oct 19, 2011, at 17:50 , Barrett, Brian W wrote:

> George -
> 
> I wrote the error code gorp; I'm pretty sure I know exactly how it was
> supposed to work.
> 
> There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
> OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
> THat's not the intent of the error code logic at all.  If you want to
> change that, I'm not necessarily opposed to it, but that's something that
> should be discussed in an RFC.  What the current code does is not
> consistent with the original intent.
> 
> I don't agree that you shouldn't propagate error codes through OMPI; in
> fact, the original intent of the design was to allow such propagation.
> Again, such a change should be discussed as part of an RFC.
> 
> Brian
> 
> On 10/19/11 4:50 PM, "George Bosilca"  wrote:
> 
>> I don't know how you think that the error codes work in Open MPI, so I'll
>> take the liberty to depict it here so we all agree we're talking about
>> the same thing.
>> 
>> The opal_strerror is a nice feature, it allow to register a range of
>> error codes with a particular error converter. Every time you look for
>> the meaning of a particular error code, the first convertor with a range
>> enveloping the looked at value, will translate it into an error string.
>> 
>> This is only currently used by OPAL and ORTE directly. It worked at the
>> OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>> ones. This behavior didn't change after my patch, you can still use
>> opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>> 
>> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>> specific error code today. The OMPI error codes are actually inserted
>> between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>> there is __no__ possible overlap right now. If at one point we add more
>> than 100 OMPI level, we should certainly revisit this.
>> 
>> Now, resulting from my patch, there is a difference. One should not
>> simply forward an ORTE code into the stack of OMPI, and hope it just
>> works. Errors should be dealt with where they happens, and if not
>> possible they should be translated into the actual layer error code. The
>> error propagation should be compartmentalized, and has to be translated
>> into an error code that has a meaning at the OMPI level. The current
>> patch should not prevent the mixed error-code code to work, as
>> opal_strerror retains the same behavior as before. However, this coding
>> practice should be avoided. I tried to clean the current code of such
>> instances few days ago in r25230.
>> 
>> Moreover, this is similar to how we deal with the error codes between
>> OMPI and MPI layers, and seems like a sane way to compose libraries. You
>> deal with a specific layer error code when you get it (usually after the
>> call to a function from that specific layer), not later on when you don't
>> even know exactly what the execution path was.
>> 
>> george.
>> 
>> PS: I'll fix the +/- issue.
>> 
>> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>> 
>>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>> codes. That seems like a very bad idea (in addition to the mixing of +
>>> and -).
>>> 
>>> For one thing, that breaks opal_strerror().  That, in itself, seems
>>> like a dealbreaker.
>>> 
>>> 
>>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>>> 
 I actually think it's worse than that.  An ORTE error code can now have
 the same error code as an OMPI error.  OMPI_ERR_REQUEST and
 ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
 Or, they should, if George hadn't made a mistake (see below).  The
 sharing
 of return codes seems... bad.
 
 Also, there's a bug in George's patch.  Error codes are all negative,
 so
 OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
 OMPI_ERR_BASE - 1, not plus 2.
 
 Brian
 
 On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
 
> I've been wrestling with something from this commit, and I'm unsure of
> the right answer. So please consider this a general design question
> for
> the community.
> 
> This commit removes all the 

[OMPI devel] RFC: upgrade to libevent 2.0.13 (removing 2.0.7)

2011-10-19 Thread Nathan Hjelm

WHAT: upgrade to libevent 2.0.13

WHY: libevent bug fixes

WHEN: Nov 2, 2011

TIMEOUT: 2 weeks

***
Jeff, Ralph, and I have been using the libevent2013 component for the last 
month without issue. In 2 weeks I will:
 - remove opal/mca/event/libevent207
 - remove opal/mca/event/libevent2013/.ompi_ignore
 - remove opal/mca/event/libevent2013/.ompi_unignore

-Nathan


Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Barrett, Brian W
George -

I wrote the error code gorp; I'm pretty sure I know exactly how it was
supposed to work.

There are 58 codes unused between OPAL_NETWORK_NOT_PARSEABLE and
OPAL_ERR_MAX.  I now see what you did with ERR_REQUEST, and it's evil.
THat's not the intent of the error code logic at all.  If you want to
change that, I'm not necessarily opposed to it, but that's something that
should be discussed in an RFC.  What the current code does is not
consistent with the original intent.

I don't agree that you shouldn't propagate error codes through OMPI; in
fact, the original intent of the design was to allow such propagation.
Again, such a change should be discussed as part of an RFC.

Brian

On 10/19/11 4:50 PM, "George Bosilca"  wrote:

>I don't know how you think that the error codes work in Open MPI, so I'll
>take the liberty to depict it here so we all agree we're talking about
>the same thing.
>
>The opal_strerror is a nice feature, it allow to register a range of
>error codes with a particular error converter. Every time you look for
>the meaning of a particular error code, the first convertor with a range
>enveloping the looked at value, will translate it into an error string.
>
>This is only currently used by OPAL and ORTE directly. It worked at the
>OMPI level only because we mapped __all__ OMPI errors to OPAL or ORTE
>ones. This behavior didn't change after my patch, you can still use
>opal_strerror to get the error string for all OPAL/ORTE/OMPI errors.
>
>There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI
>specific error code today. The OMPI error codes are actually inserted
>between the OPAL and the ORTE ones (there is a gap of 100 elements), so
>there is __no__ possible overlap right now. If at one point we add more
>than 100 OMPI level, we should certainly revisit this.
>
>Now, resulting from my patch, there is a difference. One should not
>simply forward an ORTE code into the stack of OMPI, and hope it just
>works. Errors should be dealt with where they happens, and if not
>possible they should be translated into the actual layer error code. The
>error propagation should be compartmentalized, and has to be translated
>into an error code that has a meaning at the OMPI level. The current
>patch should not prevent the mixed error-code code to work, as
>opal_strerror retains the same behavior as before. However, this coding
>practice should be avoided. I tried to clean the current code of such
>instances few days ago in r25230.
>
>Moreover, this is similar to how we deal with the error codes between
>OMPI and MPI layers, and seems like a sane way to compose libraries. You
>deal with a specific layer error code when you get it (usually after the
>call to a function from that specific layer), not later on when you don't
>even know exactly what the execution path was.
>
>  george.
>
>PS: I'll fix the +/- issue.
>
>On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
>
>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error
>>codes. That seems like a very bad idea (in addition to the mixing of +
>>and -).
>> 
>> For one thing, that breaks opal_strerror().  That, in itself, seems
>>like a dealbreaker.
>> 
>> 
>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>> 
>>> I actually think it's worse than that.  An ORTE error code can now have
>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>> Or, they should, if George hadn't made a mistake (see below).  The
>>>sharing
>>> of return codes seems... bad.
>>> 
>>> Also, there's a bug in George's patch.  Error codes are all negative,
>>>so
>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>> OMPI_ERR_BASE - 1, not plus 2.
>>> 
>>> Brian
>>> 
>>> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
>>> 
 I've been wrestling with something from this commit, and I'm unsure of
 the right answer. So please consider this a general design question
for
 the community.
 
 This commit removes all the OMPI <-> ORTE equivalent constants -
i.e., we
 used to declare OMPI-prefixed equivalents to every ORTE-prefixed
 constant. I understand the thinking (or at least, what I suspect was
the
 thought), but it creates an issue.
 
 Suppose I have an ompi-level function (A) that calls another
ompi-level
 function (B). Invisible to A is that B calls an orte-level function. B
 dutifully checks the error return from the orte-level function
against an
 ORTE-prefixed constant.
 
 However, if that return isn't "success", what does B return up to A?
It
 cannot return the OMPI equivalent to the orte error constant because
it
 no longer exists. It could return the orte error code, but A has no
way
 of knowing it is going to get a non-OMPI constant, and therefore
won't be
 able to understand it - 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Ralph Castain

On Oct 19, 2011, at 2:50 PM, George Bosilca wrote:

> I don't know how you think that the error codes work in Open MPI, so I'll 
> take the liberty to depict it here so we all agree we're talking about the 
> same thing.
> 
> The opal_strerror is a nice feature, it allow to register a range of error 
> codes with a particular error converter. Every time you look for the meaning 
> of a particular error code, the first convertor with a range enveloping the 
> looked at value, will translate it into an error string.
> 
> This is only currently used by OPAL and ORTE directly. It worked at the OMPI 
> level only because we mapped __all__ OMPI errors to OPAL or ORTE ones. This 
> behavior didn't change after my patch, you can still use opal_strerror to get 
> the error string for all OPAL/ORTE/OMPI errors.
> 
> There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI 
> specific error code today. The OMPI error codes are actually inserted between 
> the OPAL and the ORTE ones (there is a gap of 100 elements), so there is 
> __no__ possible overlap right now. If at one point we add more than 100 OMPI 
> level, we should certainly revisit this.
> 
> Now, resulting from my patch, there is a difference. One should not simply 
> forward an ORTE code into the stack of OMPI, and hope it just works. Errors 
> should be dealt with where they happens, and if not possible they should be 
> translated into the actual layer error code. The error propagation should be 
> compartmentalized, and has to be translated into an error code that has a 
> meaning at the OMPI level. The current patch should not prevent the mixed 
> error-code code to work, as opal_strerror retains the same behavior as 
> before. However, this coding practice should be avoided. I tried to clean the 
> current code of such instances few days ago in r25230.
> 
> Moreover, this is similar to how we deal with the error codes between OMPI 
> and MPI layers, and seems like a sane way to compose libraries. You deal with 
> a specific layer error code when you get it (usually after the call to a 
> function from that specific layer), not later on when you don't even know 
> exactly what the execution path was.


I'll have to ponder your logic. Not saying I disagree, but it would have been 
much nicer if you had explained your intended purpose in an RFC before imposing 
such a philosophy.

We were passing error codes up the ladder to allow higher levels to take 
corrective action that might extend beyond the scope of the immediate code - 
e.g., it might lead someone to use a specific different component in the 
framework if they knew that the RML was no longer working. We have lost that 
ability now, though we can regain it by defining OMPI error codes that don't 
equate to ORTE values, but retain the same meaning - and then translating as 
required. Not sure what that buys us, but maybe it will make some people feel 
better.


> 
>  george.
> 
> PS: I'll fix the +/- issue.
> 
> On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:
> 
>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error 
>> codes. That seems like a very bad idea (in addition to the mixing of + and 
>> -).
>> 
>> For one thing, that breaks opal_strerror().  That, in itself, seems like a 
>> dealbreaker.
>> 
>> 
>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>> 
>>> I actually think it's worse than that.  An ORTE error code can now have
>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>> Or, they should, if George hadn't made a mistake (see below).  The sharing
>>> of return codes seems... bad.
>>> 
>>> Also, there's a bug in George's patch.  Error codes are all negative, so
>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>> OMPI_ERR_BASE - 1, not plus 2.
>>> 
>>> Brian
>>> 
>>> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
>>> 
 I've been wrestling with something from this commit, and I'm unsure of
 the right answer. So please consider this a general design question for
 the community.
 
 This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
 used to declare OMPI-prefixed equivalents to every ORTE-prefixed
 constant. I understand the thinking (or at least, what I suspect was the
 thought), but it creates an issue.
 
 Suppose I have an ompi-level function (A) that calls another ompi-level
 function (B). Invisible to A is that B calls an orte-level function. B
 dutifully checks the error return from the orte-level function against an
 ORTE-prefixed constant.
 
 However, if that return isn't "success", what does B return up to A? It
 cannot return the OMPI equivalent to the orte error constant because it
 no longer exists. It could return the orte error code, but A has no way
 of knowing it is going to get a non-OMPI constant, and 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
Can I have an example on how the current trunk is broken due to this change?

Thanks,
  george.

On Oct 19, 2011, at 16:32 , Ralph Castain wrote:

> I propose that we retain the rest of the changeset, but revert the OMPI 
> constants to bring back their ORTE equivalents. We clearly should scrub those 
> and update them to ensure they are both used and current, but it seems to me 
> we lose more than we gain by removing them.
> 
> 
> On Oct 19, 2011, at 12:09 PM, Jeff Squyres wrote:
> 
>> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error 
>> codes. That seems like a very bad idea (in addition to the mixing of + and 
>> -).
>> 
>> For one thing, that breaks opal_strerror().  That, in itself, seems like a 
>> dealbreaker.
>> 
>> 
>> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
>> 
>>> I actually think it's worse than that.  An ORTE error code can now have
>>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>>> Or, they should, if George hadn't made a mistake (see below).  The sharing
>>> of return codes seems... bad.
>>> 
>>> Also, there's a bug in George's patch.  Error codes are all negative, so
>>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>>> OMPI_ERR_BASE - 1, not plus 2.
>>> 
>>> Brian
>>> 
>>> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
>>> 
 I've been wrestling with something from this commit, and I'm unsure of
 the right answer. So please consider this a general design question for
 the community.
 
 This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
 used to declare OMPI-prefixed equivalents to every ORTE-prefixed
 constant. I understand the thinking (or at least, what I suspect was the
 thought), but it creates an issue.
 
 Suppose I have an ompi-level function (A) that calls another ompi-level
 function (B). Invisible to A is that B calls an orte-level function. B
 dutifully checks the error return from the orte-level function against an
 ORTE-prefixed constant.
 
 However, if that return isn't "success", what does B return up to A? It
 cannot return the OMPI equivalent to the orte error constant because it
 no longer exists. It could return the orte error code, but A has no way
 of knowing it is going to get a non-OMPI constant, and therefore won't be
 able to understand it - it will be an "unrecognized error".
 
 I guess one option is to require that B "translate" the return code and
 pass some OMPI error up the chain, but this prevents anything upwards
 from understanding the nature of the problem and potentially taking
 corrective and/or alternative action. Seems awfully limiting, as most of
 the time the only option will be the vanilla "OMPI_ERROR".
 
 Thoughts?
>>> -- 
>>> Brian W. Barrett
>>> Dept. 1423: Scalable System Software
>>> Sandia National Laboratories
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
I don't know how you think that the error codes work in Open MPI, so I'll take 
the liberty to depict it here so we all agree we're talking about the same 
thing.

The opal_strerror is a nice feature, it allow to register a range of error 
codes with a particular error converter. Every time you look for the meaning of 
a particular error code, the first convertor with a range enveloping the looked 
at value, will translate it into an error string.

This is only currently used by OPAL and ORTE directly. It worked at the OMPI 
level only because we mapped __all__ OMPI errors to OPAL or ORTE ones. This 
behavior didn't change after my patch, you can still use opal_strerror to get 
the error string for all OPAL/ORTE/OMPI errors.

There is a small "variation" for OMPI_ERR_REQUEST, the only really OMPI 
specific error code today. The OMPI error codes are actually inserted between 
the OPAL and the ORTE ones (there is a gap of 100 elements), so there is __no__ 
possible overlap right now. If at one point we add more than 100 OMPI level, we 
should certainly revisit this.

Now, resulting from my patch, there is a difference. One should not simply 
forward an ORTE code into the stack of OMPI, and hope it just works. Errors 
should be dealt with where they happens, and if not possible they should be 
translated into the actual layer error code. The error propagation should be 
compartmentalized, and has to be translated into an error code that has a 
meaning at the OMPI level. The current patch should not prevent the mixed 
error-code code to work, as opal_strerror retains the same behavior as before. 
However, this coding practice should be avoided. I tried to clean the current 
code of such instances few days ago in r25230.

Moreover, this is similar to how we deal with the error codes between OMPI and 
MPI layers, and seems like a sane way to compose libraries. You deal with a 
specific layer error code when you get it (usually after the call to a function 
from that specific layer), not later on when you don't even know exactly what 
the execution path was.

  george.

PS: I'll fix the +/- issue.

On Oct 19, 2011, at 14:09 , Jeff Squyres wrote:

> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error codes. 
> That seems like a very bad idea (in addition to the mixing of + and -).
> 
> For one thing, that breaks opal_strerror().  That, in itself, seems like a 
> dealbreaker.
> 
> 
> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
> 
>> I actually think it's worse than that.  An ORTE error code can now have
>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>> Or, they should, if George hadn't made a mistake (see below).  The sharing
>> of return codes seems... bad.
>> 
>> Also, there's a bug in George's patch.  Error codes are all negative, so
>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>> OMPI_ERR_BASE - 1, not plus 2.
>> 
>> Brian
>> 
>> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
>> 
>>> I've been wrestling with something from this commit, and I'm unsure of
>>> the right answer. So please consider this a general design question for
>>> the community.
>>> 
>>> This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>> constant. I understand the thinking (or at least, what I suspect was the
>>> thought), but it creates an issue.
>>> 
>>> Suppose I have an ompi-level function (A) that calls another ompi-level
>>> function (B). Invisible to A is that B calls an orte-level function. B
>>> dutifully checks the error return from the orte-level function against an
>>> ORTE-prefixed constant.
>>> 
>>> However, if that return isn't "success", what does B return up to A? It
>>> cannot return the OMPI equivalent to the orte error constant because it
>>> no longer exists. It could return the orte error code, but A has no way
>>> of knowing it is going to get a non-OMPI constant, and therefore won't be
>>> able to understand it - it will be an "unrecognized error".
>>> 
>>> I guess one option is to require that B "translate" the return code and
>>> pass some OMPI error up the chain, but this prevents anything upwards
>>> from understanding the nature of the problem and potentially taking
>>> corrective and/or alternative action. Seems awfully limiting, as most of
>>> the time the only option will be the vanilla "OMPI_ERROR".
>>> 
>>> Thoughts?
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Ralph Castain
I propose that we retain the rest of the changeset, but revert the OMPI 
constants to bring back their ORTE equivalents. We clearly should scrub those 
and update them to ensure they are both used and current, but it seems to me we 
lose more than we gain by removing them.


On Oct 19, 2011, at 12:09 PM, Jeff Squyres wrote:

> Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error codes. 
> That seems like a very bad idea (in addition to the mixing of + and -).
> 
> For one thing, that breaks opal_strerror().  That, in itself, seems like a 
> dealbreaker.
> 
> 
> On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:
> 
>> I actually think it's worse than that.  An ORTE error code can now have
>> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
>> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
>> Or, they should, if George hadn't made a mistake (see below).  The sharing
>> of return codes seems... bad.
>> 
>> Also, there's a bug in George's patch.  Error codes are all negative, so
>> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
>> OMPI_ERR_BASE - 1, not plus 2.
>> 
>> Brian
>> 
>> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
>> 
>>> I've been wrestling with something from this commit, and I'm unsure of
>>> the right answer. So please consider this a general design question for
>>> the community.
>>> 
>>> This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>>> constant. I understand the thinking (or at least, what I suspect was the
>>> thought), but it creates an issue.
>>> 
>>> Suppose I have an ompi-level function (A) that calls another ompi-level
>>> function (B). Invisible to A is that B calls an orte-level function. B
>>> dutifully checks the error return from the orte-level function against an
>>> ORTE-prefixed constant.
>>> 
>>> However, if that return isn't "success", what does B return up to A? It
>>> cannot return the OMPI equivalent to the orte error constant because it
>>> no longer exists. It could return the orte error code, but A has no way
>>> of knowing it is going to get a non-OMPI constant, and therefore won't be
>>> able to understand it - it will be an "unrecognized error".
>>> 
>>> I guess one option is to require that B "translate" the return code and
>>> pass some OMPI error up the chain, but this prevents anything upwards
>>> from understanding the nature of the problem and potentially taking
>>> corrective and/or alternative action. Seems awfully limiting, as most of
>>> the time the only option will be the vanilla "OMPI_ERROR".
>>> 
>>> Thoughts?
>> -- 
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Locality info

2011-10-19 Thread Ralph Castain
Sorry - referenced the wrong commit. It was r25331


On Oct 19, 2011, at 2:28 PM, Ralph Castain wrote:

> Hi folks
> 
> For those of you who don't follow the commits...
> 
> I just committed (r25323) an extension of the orte_ess.proc_get_locality 
> function that allows a process to get its relative resource usage with any 
> other proc in the job. In other words, you can provide a process name to the 
> function, and the returned bitmask tells you if you share a node, numa, 
> socket, caches (by level), core, and hyperthread with that process.
> 
> If you are on the same node and unbound, of course, you share all of those. 
> However, if you are bound, then this can help tell you if you are on a common 
> numa node, sharing an L1 cache, etc. Might be handy.
> 
> I implemented the underlying functionality so that we can further extend it 
> to tell you the relative resource location of two procs on a remote node. If 
> that someday becomes of interest, it would be relatively easy to do - but 
> would require passing more info around. Hence, I've allowed for it, but not 
> implemented it until there is some identified need.
> 
> Locality info is available anytime after the modex is completed during 
> MPI_Init, and is supported regardless of launch environment (minus cnos, for 
> now), launch by mpirun, or direct-launch - in other words, pretty much always.
> 
> Hope it proves of help in your work
> Ralph
> 




[OMPI devel] Locality info

2011-10-19 Thread Ralph Castain
Hi folks

For those of you who don't follow the commits...

I just committed (r25323) an extension of the orte_ess.proc_get_locality 
function that allows a process to get its relative resource usage with any 
other proc in the job. In other words, you can provide a process name to the 
function, and the returned bitmask tells you if you share a node, numa, socket, 
caches (by level), core, and hyperthread with that process.

If you are on the same node and unbound, of course, you share all of those. 
However, if you are bound, then this can help tell you if you are on a common 
numa node, sharing an L1 cache, etc. Might be handy.

I implemented the underlying functionality so that we can further extend it to 
tell you the relative resource location of two procs on a remote node. If that 
someday becomes of interest, it would be relatively easy to do - but would 
require passing more info around. Hence, I've allowed for it, but not 
implemented it until there is some identified need.

Locality info is available anytime after the modex is completed during 
MPI_Init, and is supported regardless of launch environment (minus cnos, for 
now), launch by mpirun, or direct-launch - in other words, pretty much always.

Hope it proves of help in your work
Ralph




Re: [OMPI devel] make check fails for Intel 2011.6.233 (OpenMPI 1.4.3)

2011-10-19 Thread Larry Baker
I posted my findings about the bad version no. macros to the same  
thread that described the Intel V12.1 optimizer bug (http://software.intel.com/en-us/forums/showthread.php?t=87132 
).  The response I got is:



Posted By: Hubert Haberstock (Intel)
__

The build date is currently the only suitable macro. This allows to  
check for the Intel Compiler and for specific compiler versions.  
Makes sense? Regards, Hubert.

__


That is contrary to what the online V12.1 documentation says.  I'm  
going to find out what the previous versions do, then report this  
through my normal support channels.  If the documentation is wrong,  
they should fix it; if the documentation is right, they should fix the  
compiler.  (However, there will still be an errant V12.1.0 that  
reports itself as , so use of the version no. macros will never be  
reliable without a hack to handle this errant case.)  I'll report here  
what I find about the values of the version no. macros.  It is  
probably better, though, that automake/libtool rely on the output of  
icc -v, since that seems to always result in a value that matches the  
version of the product (as opposed to #define __INTEL_COMPILER   
and #define __ICC  from within the V12.1.0 compiler).


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

On 19 Oct 2011, at 10:47 AM, Jeff Squyres wrote:


Did this get reported to the Intel compiler support people?


On Oct 19, 2011, at 8:24 AM, George Bosilca wrote:


Thanks Larry,

Will forward this info upstream.

 george.

On Oct 18, 2011, at 21:56 , Larry Baker wrote:


George,

Thanks for the update.  FYI, here's all the version numbers  
reported by the compiler releases I have installed:



[baker@hydra ~]$ module load compilers/intel/11.1.080
[baker@hydra ~]$ icc -v
Version 11.1
[baker@hydra ~]$ module unload compilers/intel/11.1.080



[baker@hydra ~]$ module load compilers/intel/2011.3.174
[baker@hydra ~]$ icc -v
Version 12.0.3
[baker@hydra ~]$ module unload compilers/intel/2011.3.174



[baker@hydra ~]$ module load compilers/intel/2011.4.191
[baker@hydra ~]$ icc -v
Version 12.0.4
[baker@hydra ~]$ module unload compilers/intel/2011.4.191



[baker@hydra ~]$ module load compilers/intel/2011.5.220
[baker@hydra ~]$ icc -v
Version 12.0.5
[baker@hydra ~]$ module unload compilers/intel/2011.5.220



[baker@hydra ~]$ module load compilers/intel/2011.6.233
[baker@hydra ~]$ icc -v
icc version 12.1.0 (gcc version 4.1.2 compatibility)
[baker@hydra ~]$ module unload compilers/intel/2011.6.233


Another problem I found with the Intel 12.1.0 compiler: I started  
to look at adding a test for the Intel compiler version around the  
#pragma that disables optimization for OpenMPI and I found the  
__ICC and __INTEL_COMPILER predefined macros (compiler version  
no.) are not properly defined:


$ icc -E -dD hello.c | grep __INTEL_COMPILER
#define __INTEL_COMPILER 
#define __INTEL_COMPILER_BUILD_DATE 20110811

$ icc -E -dD hello.c | grep __ICC
#define __ICC 

$ icc -v
icc version 12.1.0 (gcc version 4.1.2 compatibility)

I do not know if there is code in OpenMPI that looks at __ICC and  
__INTEL_COMPILER, but that could cause problems.  (Pass this on  
upstream to the libtool people?)


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

On 17 Oct 2011, at 8:18 PM, George Bosilca wrote:


Larry,

Sorry for not updating this thread. The issue was identified and  
fixed by Rainer in r25290 (https://svn.open-mpi.org/trac/ompi/changeset/25290 
). Please read the comments and the linked thread on the Intel  
forum for more info about.


I couldn't find a trace of this being fixed in the 1.4 series, so  
I would wait upgrading until this issue gets resolved.


 Thanks,
   george.

On Oct 17, 2011, at 23:00 , Larry Baker wrote:


George,

I have not had time to look over the 1.4.3 make check failure  
for Intel 2011.6.233 compilers.  Have you?


I had planned to get 1.4.3 compiled on all six of our compilers  
using the latest compiler releases.  I was putting off upgrading  
to 1.4.4 or 1.5.x until after that to minimize the number of  
things that could go wrong.  Do you recommend otherwise?


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

On 7 Oct 2011, at 6:46 PM, George Bosilca wrote:

The may_alias attribute was part of a forward-looking attribute  
checking, at a time where few compiler supported them. This  
explains why they are not widely used in the library itself.  
Moreover, as they do not affect the compilation itself (as your  
test highlights this is not the issue with the icc 2011.6.233  
compiler), there is no urge to remove the may_alias support.


I just got that particular version of the compiler installed on  
one of our machines. I'll give it a try over the weekend.


 george.

On Oct 7, 2011, at 20:21 , Larry Baker wrote:

The test for the __may_alias_ attribute uses the following  
short code 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Jeff Squyres
Oy, yes, that is bad -- we cannot have overlapping ORTE and OMPI error codes. 
That seems like a very bad idea (in addition to the mixing of + and -).

For one thing, that breaks opal_strerror().  That, in itself, seems like a 
dealbreaker.


On Oct 19, 2011, at 1:51 PM, Barrett, Brian W wrote:

> I actually think it's worse than that.  An ORTE error code can now have
> the same error code as an OMPI error.  OMPI_ERR_REQUEST and
> ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
> Or, they should, if George hadn't made a mistake (see below).  The sharing
> of return codes seems... bad.
> 
> Also, there's a bug in George's patch.  Error codes are all negative, so
> OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
> OMPI_ERR_BASE - 1, not plus 2.
> 
> Brian
> 
> On 10/19/11 1:32 PM, "Ralph Castain"  wrote:
> 
>> I've been wrestling with something from this commit, and I'm unsure of
>> the right answer. So please consider this a general design question for
>> the community.
>> 
>> This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>> used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>> constant. I understand the thinking (or at least, what I suspect was the
>> thought), but it creates an issue.
>> 
>> Suppose I have an ompi-level function (A) that calls another ompi-level
>> function (B). Invisible to A is that B calls an orte-level function. B
>> dutifully checks the error return from the orte-level function against an
>> ORTE-prefixed constant.
>> 
>> However, if that return isn't "success", what does B return up to A? It
>> cannot return the OMPI equivalent to the orte error constant because it
>> no longer exists. It could return the orte error code, but A has no way
>> of knowing it is going to get a non-OMPI constant, and therefore won't be
>> able to understand it - it will be an "unrecognized error".
>> 
>> I guess one option is to require that B "translate" the return code and
>> pass some OMPI error up the chain, but this prevents anything upwards
>> from understanding the nature of the problem and potentially taking
>> corrective and/or alternative action. Seems awfully limiting, as most of
>> the time the only option will be the vanilla "OMPI_ERROR".
>> 
>> Thoughts?
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Barrett, Brian W
I actually think it's worse than that.  An ORTE error code can now have
the same error code as an OMPI error.  OMPI_ERR_REQUEST and
ORTE_ERR_RECV_LESS_THANK_POSTED now share the same integer return code.
Or, they should, if George hadn't made a mistake (see below).  The sharing
of return codes seems... bad.

Also, there's a bug in George's patch.  Error codes are all negative, so
OMPI_ERR_REQUEST should be OMPI_ERR_BASE -1 and OMPI_ERR_MAX should be
OMPI_ERR_BASE - 1, not plus 2.

Brian

On 10/19/11 1:32 PM, "Ralph Castain"  wrote:

>I've been wrestling with something from this commit, and I'm unsure of
>the right answer. So please consider this a general design question for
>the community.
>
>This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we
>used to declare OMPI-prefixed equivalents to every ORTE-prefixed
>constant. I understand the thinking (or at least, what I suspect was the
>thought), but it creates an issue.
>
>Suppose I have an ompi-level function (A) that calls another ompi-level
>function (B). Invisible to A is that B calls an orte-level function. B
>dutifully checks the error return from the orte-level function against an
>ORTE-prefixed constant.
>
>However, if that return isn't "success", what does B return up to A? It
>cannot return the OMPI equivalent to the orte error constant because it
>no longer exists. It could return the orte error code, but A has no way
>of knowing it is going to get a non-OMPI constant, and therefore won't be
>able to understand it - it will be an "unrecognized error".
>
>I guess one option is to require that B "translate" the return code and
>pass some OMPI error up the chain, but this prevents anything upwards
>from understanding the nature of the problem and potentially taking
>corrective and/or alternative action. Seems awfully limiting, as most of
>the time the only option will be the vanilla "OMPI_ERROR".
>
>Thoughts?
-- 
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories








Re: [OMPI devel] make check fails for Intel 2011.6.233 (OpenMPI 1.4.3)

2011-10-19 Thread Jeff Squyres
Did this get reported to the Intel compiler support people?


On Oct 19, 2011, at 8:24 AM, George Bosilca wrote:

> Thanks Larry,
> 
> Will forward this info upstream.
> 
>   george.
> 
> On Oct 18, 2011, at 21:56 , Larry Baker wrote:
> 
>> George,
>> 
>> Thanks for the update.  FYI, here's all the version numbers reported by the 
>> compiler releases I have installed:
>> 
>>> [baker@hydra ~]$ module load compilers/intel/11.1.080
>>> [baker@hydra ~]$ icc -v
>>> Version 11.1 
>>> [baker@hydra ~]$ module unload compilers/intel/11.1.080
>> 
>>> [baker@hydra ~]$ module load compilers/intel/2011.3.174
>>> [baker@hydra ~]$ icc -v
>>> Version 12.0.3
>>> [baker@hydra ~]$ module unload compilers/intel/2011.3.174
>> 
>>> [baker@hydra ~]$ module load compilers/intel/2011.4.191
>>> [baker@hydra ~]$ icc -v
>>> Version 12.0.4
>>> [baker@hydra ~]$ module unload compilers/intel/2011.4.191
>> 
>>> [baker@hydra ~]$ module load compilers/intel/2011.5.220
>>> [baker@hydra ~]$ icc -v
>>> Version 12.0.5
>>> [baker@hydra ~]$ module unload compilers/intel/2011.5.220
>> 
>>> [baker@hydra ~]$ module load compilers/intel/2011.6.233
>>> [baker@hydra ~]$ icc -v
>>> icc version 12.1.0 (gcc version 4.1.2 compatibility)
>>> [baker@hydra ~]$ module unload compilers/intel/2011.6.233
>> 
>> Another problem I found with the Intel 12.1.0 compiler: I started to look at 
>> adding a test for the Intel compiler version around the #pragma that 
>> disables optimization for OpenMPI and I found the __ICC and __INTEL_COMPILER 
>> predefined macros (compiler version no.) are not properly defined:
>> 
>> $ icc -E -dD hello.c | grep __INTEL_COMPILER
>> #define __INTEL_COMPILER 
>> #define __INTEL_COMPILER_BUILD_DATE 20110811
>> 
>> $ icc -E -dD hello.c | grep __ICC   
>> #define __ICC 
>> 
>> $ icc -v
>> icc version 12.1.0 (gcc version 4.1.2 compatibility)
>> 
>> I do not know if there is code in OpenMPI that looks at __ICC and 
>> __INTEL_COMPILER, but that could cause problems.  (Pass this on upstream to 
>> the libtool people?)
>> 
>> Larry Baker
>> US Geological Survey
>> 650-329-5608
>> ba...@usgs.gov
>> 
>> On 17 Oct 2011, at 8:18 PM, George Bosilca wrote:
>> 
>>> Larry,
>>> 
>>> Sorry for not updating this thread. The issue was identified and fixed by 
>>> Rainer in r25290 (https://svn.open-mpi.org/trac/ompi/changeset/25290). 
>>> Please read the comments and the linked thread on the Intel forum for more 
>>> info about.
>>> 
>>> I couldn't find a trace of this being fixed in the 1.4 series, so I would 
>>> wait upgrading until this issue gets resolved.
>>> 
>>>   Thanks,
>>> george.
>>> 
>>> On Oct 17, 2011, at 23:00 , Larry Baker wrote:
>>> 
 George,
 
 I have not had time to look over the 1.4.3 make check failure for Intel 
 2011.6.233 compilers.  Have you?
 
 I had planned to get 1.4.3 compiled on all six of our compilers using the 
 latest compiler releases.  I was putting off upgrading to 1.4.4 or 1.5.x 
 until after that to minimize the number of things that could go wrong.  Do 
 you recommend otherwise?
 
 Larry Baker
 US Geological Survey
 650-329-5608
 ba...@usgs.gov
 
 On 7 Oct 2011, at 6:46 PM, George Bosilca wrote:
 
> The may_alias attribute was part of a forward-looking attribute checking, 
> at a time where few compiler supported them. This explains why they are 
> not widely used in the library itself. Moreover, as they do not affect 
> the compilation itself (as your test highlights this is not the issue 
> with the icc 2011.6.233 compiler), there is no urge to remove the 
> may_alias support.
> 
> I just got that particular version of the compiler installed on one of 
> our machines. I'll give it a try over the weekend.
> 
>   george.
> 
> On Oct 7, 2011, at 20:21 , Larry Baker wrote:
> 
>> The test for the __may_alias_ attribute uses the following short code 
>> snippet:
>> 
>>> int * p_value __attribute__ ((__may_alias__));
>>> int
>>> main ()
>>> {
>>> 
>>>   ;
>>>   return 0;
>>> }
>> 
>> Indeed, for Intel 2011 compilers prior to 2011.6.233, this results in a 
>> warning:
>> 
>>> root@hydra openmpi-1.4.3]# module load compilers/intel/2011.5.220
>>> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c 
>>> may_alias_test.c(123): warning #1292: attribute "__may_alias__" ignored
>>>   int * p_value __attribute__ ((__may_alias__));
>>> ^
>>> 
>>> [root@hydra openmpi-1.4.3]# module unload compilers/intel/2011.5.220
>> 
>>> [root@hydra openmpi-1.4.3]# module load compilers/intel/2011.6.233
>>> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c 
>> 
>> I modified ./configure to force
>> 
>>> ompi_cv___attribute__may_alias=0
>> 
>> Then I compiled and tested the library.  Unfortunately, the results were 
>> exactly the same:
>> 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Ralph Castain
I've been wrestling with something from this commit, and I'm unsure of the 
right answer. So please consider this a general design question for the 
community.

This commit removes all the OMPI <-> ORTE equivalent constants - i.e., we used 
to declare OMPI-prefixed equivalents to every ORTE-prefixed constant. I 
understand the thinking (or at least, what I suspect was the thought), but it 
creates an issue.

Suppose I have an ompi-level function (A) that calls another ompi-level 
function (B). Invisible to A is that B calls an orte-level function. B 
dutifully checks the error return from the orte-level function against an 
ORTE-prefixed constant.

However, if that return isn't "success", what does B return up to A? It cannot 
return the OMPI equivalent to the orte error constant because it no longer 
exists. It could return the orte error code, but A has no way of knowing it is 
going to get a non-OMPI constant, and therefore won't be able to understand it 
- it will be an "unrecognized error".

I guess one option is to require that B "translate" the return code and pass 
some OMPI error up the chain, but this prevents anything upwards from 
understanding the nature of the problem and potentially taking corrective 
and/or alternative action. Seems awfully limiting, as most of the time the only 
option will be the vanilla "OMPI_ERROR".

Thoughts?


On Oct 18, 2011, at 9:51 PM, bosi...@osl.iu.edu wrote:

> Author: bosilca
> Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
> New Revision: 25323
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
> 
> Log:
> Cleanup the error codes. Get rid of all the useless ones, and
> mark the distinction between ORTE and OMPI errors.
> 
> Text files modified: 
>   trunk/ompi/errhandler/errcode-internal.c |32 ---
>  
>   trunk/ompi/include/ompi/constants.h  |80 
> +---
>   trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +- 
>  
>   trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 -- 
>  
>   trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++ 
>  
>   5 files changed, 43 insertions(+), 85 deletions(-)
> 
> Modified: trunk/ompi/errhandler/errcode-internal.c
> ==
> --- trunk/ompi/errhandler/errcode-internal.c  (original)
> +++ trunk/ompi/errhandler/errcode-internal.c  2011-10-18 23:51:53 EDT (Tue, 
> 18 Oct 2011)
> @@ -3,7 +3,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  * University Research and Technology
>  * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2007 The University of Tennessee and The University
> + * Copyright (c) 2004-2011 The University of Tennessee and The University
>  * of Tennessee Research Foundation.  All rights
>  * reserved.
>  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
> @@ -35,9 +35,6 @@
> static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
> static ompi_errcode_intern_t ompi_err_resource_busy;
> static ompi_errcode_intern_t ompi_err_bad_param;
> -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
> -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
> -static ompi_errcode_intern_t ompi_err_no_match_yet;
> static ompi_errcode_intern_t ompi_err_fatal;
> static ompi_errcode_intern_t ompi_err_not_implemented;
> static ompi_errcode_intern_t ompi_err_not_supported;
> @@ -115,30 +112,6 @@
> opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_bad_param.index, 
> _err_bad_param);
> 
> -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
> -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
> -ompi_err_recv_less_than_posted.index = pos++;
> -strncpy(ompi_err_recv_less_than_posted.errstring, 
> "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_less_than_posted.index, 
> -_err_recv_less_than_posted);
> -
> -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
> -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
> -ompi_err_recv_more_than_posted.index = pos++;
> -strncpy(ompi_err_recv_more_than_posted.errstring, 
> "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_more_than_posted.index, 
> -_err_recv_more_than_posted);
> -
> -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
> -

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
Indeed, I removed some of the OMPI level error codes. As you can see in the 
patch they were defined but never used.

I don't think they were worth an RFC, as they are not only never used in the 
trunk, but on 1.5 and 1.4. And I did check it because I was wondering why they 
existed in the first place.

If [by some miracle] they are used by people working on non-trunk branches, I 
do apologize for the inconvenience to them.

  george.

On Oct 19, 2011, at 10:37 , Jeff Squyres wrote:

> George --
> 
> Did you actually remove some of the error codes?
> 
> I think that should have been worthy of a (quick) RFC first, just to let 
> people know who are working in non-trunk branches who might have been using 
> them.
> 
> 
> On Oct 18, 2011, at 11:51 PM, bosi...@osl.iu.edu wrote:
> 
>> Author: bosilca
>> Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
>> New Revision: 25323
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
>> 
>> Log:
>> Cleanup the error codes. Get rid of all the useless ones, and
>> mark the distinction between ORTE and OMPI errors.
>> 
>> Text files modified: 
>>  trunk/ompi/errhandler/errcode-internal.c |32 ---
>>  
>>  trunk/ompi/include/ompi/constants.h  |80 
>> +---
>>  trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +- 
>>  
>>  trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 -- 
>>  
>>  trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++ 
>>  
>>  5 files changed, 43 insertions(+), 85 deletions(-)
>> 
>> Modified: trunk/ompi/errhandler/errcode-internal.c
>> ==
>> --- trunk/ompi/errhandler/errcode-internal.c (original)
>> +++ trunk/ompi/errhandler/errcode-internal.c 2011-10-18 23:51:53 EDT (Tue, 
>> 18 Oct 2011)
>> @@ -3,7 +3,7 @@
>> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> * University Research and Technology
>> * Corporation.  All rights reserved.
>> - * Copyright (c) 2004-2007 The University of Tennessee and The University
>> + * Copyright (c) 2004-2011 The University of Tennessee and The University
>> * of Tennessee Research Foundation.  All rights
>> * reserved.
>> * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
>> @@ -35,9 +35,6 @@
>> static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
>> static ompi_errcode_intern_t ompi_err_resource_busy;
>> static ompi_errcode_intern_t ompi_err_bad_param;
>> -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
>> -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
>> -static ompi_errcode_intern_t ompi_err_no_match_yet;
>> static ompi_errcode_intern_t ompi_err_fatal;
>> static ompi_errcode_intern_t ompi_err_not_implemented;
>> static ompi_errcode_intern_t ompi_err_not_supported;
>> @@ -115,30 +112,6 @@
>>opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_bad_param.index, 
>>_err_bad_param);
>> 
>> -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
>> -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
>> -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
>> -ompi_err_recv_less_than_posted.index = pos++;
>> -strncpy(ompi_err_recv_less_than_posted.errstring, 
>> "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_recv_less_than_posted.index, 
>> -_err_recv_less_than_posted);
>> -
>> -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
>> -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
>> -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
>> -ompi_err_recv_more_than_posted.index = pos++;
>> -strncpy(ompi_err_recv_more_than_posted.errstring, 
>> "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_recv_more_than_posted.index, 
>> -_err_recv_more_than_posted);
>> -
>> -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
>> -ompi_err_no_match_yet.code = OMPI_ERR_NO_MATCH_YET;
>> -ompi_err_no_match_yet.mpi_code = MPI_ERR_PENDING;
>> -ompi_err_no_match_yet.index = pos++;
>> -strncpy(ompi_err_no_match_yet.errstring, "OMPI_ERR_NO_MATCH_YET", 
>> OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_no_match_yet.index, 
>> -_err_no_match_yet);
>> -
>>OBJ_CONSTRUCT(_err_fatal, ompi_errcode_intern_t);
>>ompi_err_fatal.code = OMPI_ERR_FATAL;
>>ompi_err_fatal.mpi_code = MPI_ERR_INTERN;
>> @@ -232,9 +205,6 @@
>>

Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25323

2011-10-19 Thread Jeff Squyres
George --

Did you actually remove some of the error codes?

I think that should have been worthy of a (quick) RFC first, just to let people 
know who are working in non-trunk branches who might have been using them.


On Oct 18, 2011, at 11:51 PM, bosi...@osl.iu.edu wrote:

> Author: bosilca
> Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
> New Revision: 25323
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
> 
> Log:
> Cleanup the error codes. Get rid of all the useless ones, and
> mark the distinction between ORTE and OMPI errors.
> 
> Text files modified: 
>   trunk/ompi/errhandler/errcode-internal.c |32 ---
>  
>   trunk/ompi/include/ompi/constants.h  |80 
> +---
>   trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +- 
>  
>   trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 -- 
>  
>   trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++ 
>  
>   5 files changed, 43 insertions(+), 85 deletions(-)
> 
> Modified: trunk/ompi/errhandler/errcode-internal.c
> ==
> --- trunk/ompi/errhandler/errcode-internal.c  (original)
> +++ trunk/ompi/errhandler/errcode-internal.c  2011-10-18 23:51:53 EDT (Tue, 
> 18 Oct 2011)
> @@ -3,7 +3,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  * University Research and Technology
>  * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2007 The University of Tennessee and The University
> + * Copyright (c) 2004-2011 The University of Tennessee and The University
>  * of Tennessee Research Foundation.  All rights
>  * reserved.
>  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
> @@ -35,9 +35,6 @@
> static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
> static ompi_errcode_intern_t ompi_err_resource_busy;
> static ompi_errcode_intern_t ompi_err_bad_param;
> -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
> -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
> -static ompi_errcode_intern_t ompi_err_no_match_yet;
> static ompi_errcode_intern_t ompi_err_fatal;
> static ompi_errcode_intern_t ompi_err_not_implemented;
> static ompi_errcode_intern_t ompi_err_not_supported;
> @@ -115,30 +112,6 @@
> opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_bad_param.index, 
> _err_bad_param);
> 
> -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
> -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
> -ompi_err_recv_less_than_posted.index = pos++;
> -strncpy(ompi_err_recv_less_than_posted.errstring, 
> "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_less_than_posted.index, 
> -_err_recv_less_than_posted);
> -
> -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
> -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
> -ompi_err_recv_more_than_posted.index = pos++;
> -strncpy(ompi_err_recv_more_than_posted.errstring, 
> "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_more_than_posted.index, 
> -_err_recv_more_than_posted);
> -
> -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
> -ompi_err_no_match_yet.code = OMPI_ERR_NO_MATCH_YET;
> -ompi_err_no_match_yet.mpi_code = MPI_ERR_PENDING;
> -ompi_err_no_match_yet.index = pos++;
> -strncpy(ompi_err_no_match_yet.errstring, "OMPI_ERR_NO_MATCH_YET", 
> OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_no_match_yet.index, 
> -_err_no_match_yet);
> -
> OBJ_CONSTRUCT(_err_fatal, ompi_errcode_intern_t);
> ompi_err_fatal.code = OMPI_ERR_FATAL;
> ompi_err_fatal.mpi_code = MPI_ERR_INTERN;
> @@ -232,9 +205,6 @@
> OBJ_DESTRUCT(_err_temp_out_of_resource);
> OBJ_DESTRUCT(_err_resource_busy);
> OBJ_DESTRUCT(_err_bad_param);
> -OBJ_DESTRUCT(_err_recv_less_than_posted);
> -OBJ_DESTRUCT(_err_recv_more_than_posted);
> -OBJ_DESTRUCT(_err_no_match_yet);
> OBJ_DESTRUCT(_err_fatal);
> OBJ_DESTRUCT(_err_not_implemented);
> OBJ_DESTRUCT(_err_not_supported);
> 
> Modified: trunk/ompi/include/ompi/constants.h
> ==
> --- trunk/ompi/include/ompi/constants.h   (original)
> +++ 

[OMPI devel] Removing error message

2011-10-19 Thread Jeff Squyres
George --

Can you put this back?

I don't think the error message is meaningless.  It's there because people 
typically copy-n-paste the error message to the user's list (or whatever their 
support channel is).  That error message will mean something to an OMPI 
developer; (I'm guessing/assuming) that's why it was there.


On Oct 19, 2011, at 9:04 AM, bosi...@osl.iu.edu wrote:

> Author: bosilca
> Date: 2011-10-19 09:04:46 EDT (Wed, 19 Oct 2011)
> New Revision: 25324
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25324
> 
> Log:
> The error here is meaningless.
> 
> Text files modified: 
>   trunk/ompi/debuggers/ompi_debuggers.c | 4 ++--  
>   
>   1 files changed, 2 insertions(+), 2 deletions(-)
> 
> Modified: trunk/ompi/debuggers/ompi_debuggers.c
> ==
> --- trunk/ompi/debuggers/ompi_debuggers.c (original)
> +++ trunk/ompi/debuggers/ompi_debuggers.c 2011-10-19 09:04:46 EDT (Wed, 
> 19 Oct 2011)
> @@ -260,8 +260,8 @@
> /* if it failed for some reason, then we are in trouble -
>  * for now, just report the problem and give up waiting
>  */
> -opal_output(0, "Debugger_attach[rank=%ld]: could not wait for 
> debugger - error %s!",
> -(long)ORTE_PROC_MY_NAME->vpid, ORTE_ERROR_NAME(rc));
> +opal_output(0, "Debugger_attach[rank=%ld]: could not wait for 
> debugger!",
> +(long)ORTE_PROC_MY_NAME->vpid);
> }
> }
> #endif
> ___
> svn-full mailing list
> svn-f...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Ralph Castain
It's not just my components, George - there are people with branches out there 
that have OMPI components and changes in them. If you are going to gripe when 
others make changes without warning, then you should abide by your own rules.

:-)


On Oct 19, 2011, at 8:16 AM, George Bosilca wrote:

> OK, just saw your commit. It make sense, an OMPI component should return OMPI 
> error codes. Thanks for the fix.
> 
>  george.
> 
> On Oct 19, 2011, at 10:12 , George Bosilca wrote:
> 
>> I run an entire battery of tests on these without any issues. Moreover it is 
>> an OMPI related thing, and these error messages were never used. Anyway, 
>> please let me know what exactly failed, I'll fix it asap.
>> 
>> Thanks,
>>   george.
>> 
>> On Oct 19, 2011, at 10:06 , Ralph Castain wrote:
>> 
>>> If you are going to make such sweeping changes, could you please provide a 
>>> little warning as per our usual methods? This broke several things which 
>>> can be repaired, but would have been nice to know that we were going to 
>>> make such a change.
>>> 
>>> Thx
>>> 
>>> 
>>> On Oct 18, 2011, at 9:51 PM, bosi...@osl.iu.edu wrote:
>>> 
 Author: bosilca
 Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
 New Revision: 25323
 URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
 
 Log:
 Cleanup the error codes. Get rid of all the useless ones, and
 mark the distinction between ORTE and OMPI errors.
 
 Text files modified: 
 trunk/ompi/errhandler/errcode-internal.c |32 ---   
   
 trunk/ompi/include/ompi/constants.h  |80 
 +---
 trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +-
   
 trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 --
   
 trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++
   
 5 files changed, 43 insertions(+), 85 deletions(-)
 
 Modified: trunk/ompi/errhandler/errcode-internal.c
 ==
 --- trunk/ompi/errhandler/errcode-internal.c   (original)
 +++ trunk/ompi/errhandler/errcode-internal.c   2011-10-18 23:51:53 EDT 
 (Tue, 18 Oct 2011)
 @@ -3,7 +3,7 @@
 * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
 * University Research and Technology
 * Corporation.  All rights reserved.
 - * Copyright (c) 2004-2007 The University of Tennessee and The University
 + * Copyright (c) 2004-2011 The University of Tennessee and The University
 * of Tennessee Research Foundation.  All rights
 * reserved.
 * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
 @@ -35,9 +35,6 @@
 static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
 static ompi_errcode_intern_t ompi_err_resource_busy;
 static ompi_errcode_intern_t ompi_err_bad_param;
 -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
 -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
 -static ompi_errcode_intern_t ompi_err_no_match_yet;
 static ompi_errcode_intern_t ompi_err_fatal;
 static ompi_errcode_intern_t ompi_err_not_implemented;
 static ompi_errcode_intern_t ompi_err_not_supported;
 @@ -115,30 +112,6 @@
  opal_pointer_array_set_item(_errcodes_intern, 
 ompi_err_bad_param.index, 
  _err_bad_param);
 
 -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
 -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
 -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
 -ompi_err_recv_less_than_posted.index = pos++;
 -strncpy(ompi_err_recv_less_than_posted.errstring, 
 "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
 -opal_pointer_array_set_item(_errcodes_intern, 
 ompi_err_recv_less_than_posted.index, 
 -_err_recv_less_than_posted);
 -
 -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
 -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
 -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
 -ompi_err_recv_more_than_posted.index = pos++;
 -strncpy(ompi_err_recv_more_than_posted.errstring, 
 "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
 -opal_pointer_array_set_item(_errcodes_intern, 
 ompi_err_recv_more_than_posted.index, 
 -_err_recv_more_than_posted);
 -
 -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
 -ompi_err_no_match_yet.code = OMPI_ERR_NO_MATCH_YET;
 -ompi_err_no_match_yet.mpi_code = 

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread George Bosilca
I run an entire battery of tests on these without any issues. Moreover it is an 
OMPI related thing, and these error messages were never used. Anyway, please 
let me know what exactly failed, I'll fix it asap.

  Thanks,
george.

On Oct 19, 2011, at 10:06 , Ralph Castain wrote:

> If you are going to make such sweeping changes, could you please provide a 
> little warning as per our usual methods? This broke several things which can 
> be repaired, but would have been nice to know that we were going to make such 
> a change.
> 
> Thx
> 
> 
> On Oct 18, 2011, at 9:51 PM, bosi...@osl.iu.edu wrote:
> 
>> Author: bosilca
>> Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
>> New Revision: 25323
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
>> 
>> Log:
>> Cleanup the error codes. Get rid of all the useless ones, and
>> mark the distinction between ORTE and OMPI errors.
>> 
>> Text files modified: 
>>  trunk/ompi/errhandler/errcode-internal.c |32 ---
>>  
>>  trunk/ompi/include/ompi/constants.h  |80 
>> +---
>>  trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +- 
>>  
>>  trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 -- 
>>  
>>  trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++ 
>>  
>>  5 files changed, 43 insertions(+), 85 deletions(-)
>> 
>> Modified: trunk/ompi/errhandler/errcode-internal.c
>> ==
>> --- trunk/ompi/errhandler/errcode-internal.c (original)
>> +++ trunk/ompi/errhandler/errcode-internal.c 2011-10-18 23:51:53 EDT (Tue, 
>> 18 Oct 2011)
>> @@ -3,7 +3,7 @@
>> * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>> * University Research and Technology
>> * Corporation.  All rights reserved.
>> - * Copyright (c) 2004-2007 The University of Tennessee and The University
>> + * Copyright (c) 2004-2011 The University of Tennessee and The University
>> * of Tennessee Research Foundation.  All rights
>> * reserved.
>> * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
>> @@ -35,9 +35,6 @@
>> static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
>> static ompi_errcode_intern_t ompi_err_resource_busy;
>> static ompi_errcode_intern_t ompi_err_bad_param;
>> -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
>> -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
>> -static ompi_errcode_intern_t ompi_err_no_match_yet;
>> static ompi_errcode_intern_t ompi_err_fatal;
>> static ompi_errcode_intern_t ompi_err_not_implemented;
>> static ompi_errcode_intern_t ompi_err_not_supported;
>> @@ -115,30 +112,6 @@
>>opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_bad_param.index, 
>>_err_bad_param);
>> 
>> -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
>> -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
>> -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
>> -ompi_err_recv_less_than_posted.index = pos++;
>> -strncpy(ompi_err_recv_less_than_posted.errstring, 
>> "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_recv_less_than_posted.index, 
>> -_err_recv_less_than_posted);
>> -
>> -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
>> -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
>> -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
>> -ompi_err_recv_more_than_posted.index = pos++;
>> -strncpy(ompi_err_recv_more_than_posted.errstring, 
>> "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_recv_more_than_posted.index, 
>> -_err_recv_more_than_posted);
>> -
>> -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
>> -ompi_err_no_match_yet.code = OMPI_ERR_NO_MATCH_YET;
>> -ompi_err_no_match_yet.mpi_code = MPI_ERR_PENDING;
>> -ompi_err_no_match_yet.index = pos++;
>> -strncpy(ompi_err_no_match_yet.errstring, "OMPI_ERR_NO_MATCH_YET", 
>> OMPI_MAX_ERROR_STRING);
>> -opal_pointer_array_set_item(_errcodes_intern, 
>> ompi_err_no_match_yet.index, 
>> -_err_no_match_yet);
>> -
>>OBJ_CONSTRUCT(_err_fatal, ompi_errcode_intern_t);
>>ompi_err_fatal.code = OMPI_ERR_FATAL;
>>ompi_err_fatal.mpi_code = MPI_ERR_INTERN;
>> @@ -232,9 +205,6 @@
>>OBJ_DESTRUCT(_err_temp_out_of_resource);
>>OBJ_DESTRUCT(_err_resource_busy);
>>OBJ_DESTRUCT(_err_bad_param);
>> -OBJ_DESTRUCT(_err_recv_less_than_posted);
>> -

Re: [OMPI devel] [OMPI svn] svn:open-mpi r25323

2011-10-19 Thread Ralph Castain
If you are going to make such sweeping changes, could you please provide a 
little warning as per our usual methods? This broke several things which can be 
repaired, but would have been nice to know that we were going to make such a 
change.

Thx


On Oct 18, 2011, at 9:51 PM, bosi...@osl.iu.edu wrote:

> Author: bosilca
> Date: 2011-10-18 23:51:53 EDT (Tue, 18 Oct 2011)
> New Revision: 25323
> URL: https://svn.open-mpi.org/trac/ompi/changeset/25323
> 
> Log:
> Cleanup the error codes. Get rid of all the useless ones, and
> mark the distinction between ORTE and OMPI errors.
> 
> Text files modified: 
>   trunk/ompi/errhandler/errcode-internal.c |32 ---
>  
>   trunk/ompi/include/ompi/constants.h  |80 
> +---
>   trunk/ompi/mca/common/sm/common_sm_rml.c | 6 +- 
>  
>   trunk/ompi/mca/pml/dr/pml_dr_sendreq.c   | 5 -- 
>  
>   trunk/ompi/mpiext/cr/c/quiesce_start.c   | 5 ++ 
>  
>   5 files changed, 43 insertions(+), 85 deletions(-)
> 
> Modified: trunk/ompi/errhandler/errcode-internal.c
> ==
> --- trunk/ompi/errhandler/errcode-internal.c  (original)
> +++ trunk/ompi/errhandler/errcode-internal.c  2011-10-18 23:51:53 EDT (Tue, 
> 18 Oct 2011)
> @@ -3,7 +3,7 @@
>  * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
>  * University Research and Technology
>  * Corporation.  All rights reserved.
> - * Copyright (c) 2004-2007 The University of Tennessee and The University
> + * Copyright (c) 2004-2011 The University of Tennessee and The University
>  * of Tennessee Research Foundation.  All rights
>  * reserved.
>  * Copyright (c) 2004-2005 High Performance Computing Center Stuttgart, 
> @@ -35,9 +35,6 @@
> static ompi_errcode_intern_t ompi_err_temp_out_of_resource;
> static ompi_errcode_intern_t ompi_err_resource_busy;
> static ompi_errcode_intern_t ompi_err_bad_param;
> -static ompi_errcode_intern_t ompi_err_recv_less_than_posted;
> -static ompi_errcode_intern_t ompi_err_recv_more_than_posted;
> -static ompi_errcode_intern_t ompi_err_no_match_yet;
> static ompi_errcode_intern_t ompi_err_fatal;
> static ompi_errcode_intern_t ompi_err_not_implemented;
> static ompi_errcode_intern_t ompi_err_not_supported;
> @@ -115,30 +112,6 @@
> opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_bad_param.index, 
> _err_bad_param);
> 
> -OBJ_CONSTRUCT(_err_recv_less_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_less_than_posted.code = OMPI_ERR_RECV_LESS_THAN_POSTED;
> -ompi_err_recv_less_than_posted.mpi_code = MPI_SUCCESS;
> -ompi_err_recv_less_than_posted.index = pos++;
> -strncpy(ompi_err_recv_less_than_posted.errstring, 
> "OMPI_ERR_RECV_LESS_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_less_than_posted.index, 
> -_err_recv_less_than_posted);
> -
> -OBJ_CONSTRUCT(_err_recv_more_than_posted, ompi_errcode_intern_t);
> -ompi_err_recv_more_than_posted.code = OMPI_ERR_RECV_MORE_THAN_POSTED;
> -ompi_err_recv_more_than_posted.mpi_code = MPI_ERR_TRUNCATE;
> -ompi_err_recv_more_than_posted.index = pos++;
> -strncpy(ompi_err_recv_more_than_posted.errstring, 
> "OMPI_ERR_RECV_MORE_THAN_POSTED", OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_recv_more_than_posted.index, 
> -_err_recv_more_than_posted);
> -
> -OBJ_CONSTRUCT(_err_no_match_yet, ompi_errcode_intern_t);
> -ompi_err_no_match_yet.code = OMPI_ERR_NO_MATCH_YET;
> -ompi_err_no_match_yet.mpi_code = MPI_ERR_PENDING;
> -ompi_err_no_match_yet.index = pos++;
> -strncpy(ompi_err_no_match_yet.errstring, "OMPI_ERR_NO_MATCH_YET", 
> OMPI_MAX_ERROR_STRING);
> -opal_pointer_array_set_item(_errcodes_intern, 
> ompi_err_no_match_yet.index, 
> -_err_no_match_yet);
> -
> OBJ_CONSTRUCT(_err_fatal, ompi_errcode_intern_t);
> ompi_err_fatal.code = OMPI_ERR_FATAL;
> ompi_err_fatal.mpi_code = MPI_ERR_INTERN;
> @@ -232,9 +205,6 @@
> OBJ_DESTRUCT(_err_temp_out_of_resource);
> OBJ_DESTRUCT(_err_resource_busy);
> OBJ_DESTRUCT(_err_bad_param);
> -OBJ_DESTRUCT(_err_recv_less_than_posted);
> -OBJ_DESTRUCT(_err_recv_more_than_posted);
> -OBJ_DESTRUCT(_err_no_match_yet);
> OBJ_DESTRUCT(_err_fatal);
> OBJ_DESTRUCT(_err_not_implemented);
> OBJ_DESTRUCT(_err_not_supported);
> 
> Modified: trunk/ompi/include/ompi/constants.h
> ==
> --- trunk/ompi/include/ompi/constants.h   (original)
> 

Re: [OMPI devel] make check fails for Intel 2011.6.233 (OpenMPI 1.4.3)

2011-10-19 Thread George Bosilca
Thanks Larry,

Will forward this info upstream.

  george.

On Oct 18, 2011, at 21:56 , Larry Baker wrote:

> George,
> 
> Thanks for the update.  FYI, here's all the version numbers reported by the 
> compiler releases I have installed:
> 
>> [baker@hydra ~]$ module load compilers/intel/11.1.080
>> [baker@hydra ~]$ icc -v
>> Version 11.1 
>> [baker@hydra ~]$ module unload compilers/intel/11.1.080
> 
>> [baker@hydra ~]$ module load compilers/intel/2011.3.174
>> [baker@hydra ~]$ icc -v
>> Version 12.0.3
>> [baker@hydra ~]$ module unload compilers/intel/2011.3.174
> 
>> [baker@hydra ~]$ module load compilers/intel/2011.4.191
>> [baker@hydra ~]$ icc -v
>> Version 12.0.4
>> [baker@hydra ~]$ module unload compilers/intel/2011.4.191
> 
>> [baker@hydra ~]$ module load compilers/intel/2011.5.220
>> [baker@hydra ~]$ icc -v
>> Version 12.0.5
>> [baker@hydra ~]$ module unload compilers/intel/2011.5.220
> 
>> [baker@hydra ~]$ module load compilers/intel/2011.6.233
>> [baker@hydra ~]$ icc -v
>> icc version 12.1.0 (gcc version 4.1.2 compatibility)
>> [baker@hydra ~]$ module unload compilers/intel/2011.6.233
> 
> 
> Another problem I found with the Intel 12.1.0 compiler: I started to look at 
> adding a test for the Intel compiler version around the #pragma that disables 
> optimization for OpenMPI and I found the __ICC and __INTEL_COMPILER 
> predefined macros (compiler version no.) are not properly defined:
> 
> $ icc -E -dD hello.c | grep __INTEL_COMPILER
> #define __INTEL_COMPILER 
> #define __INTEL_COMPILER_BUILD_DATE 20110811
> 
> $ icc -E -dD hello.c | grep __ICC   
> #define __ICC 
> 
> $ icc -v
> icc version 12.1.0 (gcc version 4.1.2 compatibility)
> 
> I do not know if there is code in OpenMPI that looks at __ICC and 
> __INTEL_COMPILER, but that could cause problems.  (Pass this on upstream to 
> the libtool people?)
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
> 
> On 17 Oct 2011, at 8:18 PM, George Bosilca wrote:
> 
>> Larry,
>> 
>> Sorry for not updating this thread. The issue was identified and fixed by 
>> Rainer in r25290 (https://svn.open-mpi.org/trac/ompi/changeset/25290). 
>> Please read the comments and the linked thread on the Intel forum for more 
>> info about.
>> 
>> I couldn't find a trace of this being fixed in the 1.4 series, so I would 
>> wait upgrading until this issue gets resolved.
>> 
>>   Thanks,
>> george.
>> 
>> On Oct 17, 2011, at 23:00 , Larry Baker wrote:
>> 
>>> George,
>>> 
>>> I have not had time to look over the 1.4.3 make check failure for Intel 
>>> 2011.6.233 compilers.  Have you?
>>> 
>>> I had planned to get 1.4.3 compiled on all six of our compilers using the 
>>> latest compiler releases.  I was putting off upgrading to 1.4.4 or 1.5.x 
>>> until after that to minimize the number of things that could go wrong.  Do 
>>> you recommend otherwise?
>>> 
>>> Larry Baker
>>> US Geological Survey
>>> 650-329-5608
>>> ba...@usgs.gov
>>> 
>>> On 7 Oct 2011, at 6:46 PM, George Bosilca wrote:
>>> 
 The may_alias attribute was part of a forward-looking attribute checking, 
 at a time where few compiler supported them. This explains why they are 
 not widely used in the library itself. Moreover, as they do not affect the 
 compilation itself (as your test highlights this is not the issue with the 
 icc 2011.6.233 compiler), there is no urge to remove the may_alias support.
 
 I just got that particular version of the compiler installed on one of our 
 machines. I'll give it a try over the weekend.
 
   george.
 
 On Oct 7, 2011, at 20:21 , Larry Baker wrote:
 
> The test for the __may_alias_ attribute uses the following short code 
> snippet:
> 
>> int * p_value __attribute__ ((__may_alias__));
>> int
>> main ()
>> {
>> 
>>   ;
>>   return 0;
>> }
> 
> Indeed, for Intel 2011 compilers prior to 2011.6.233, this results in a 
> warning:
> 
>> root@hydra openmpi-1.4.3]# module load compilers/intel/2011.5.220
>> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c 
>> may_alias_test.c(123): warning #1292: attribute "__may_alias__" ignored
>>   int * p_value __attribute__ ((__may_alias__));
>> ^
>> 
>> [root@hydra openmpi-1.4.3]# module unload compilers/intel/2011.5.220
> 
>> [root@hydra openmpi-1.4.3]# module load compilers/intel/2011.6.233
>> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c 
> 
> 
> I modified ./configure to force
> 
>> ompi_cv___attribute__may_alias=0
> 
> 
> Then I compiled and tested the library.  Unfortunately, the results were 
> exactly the same:
> 
>> make  check-TESTS
>> make[3]: Entering directory 
>> `/state/partition1/root/src/openmpi-1.4.3/test/datatype'
>> /bin/sh: line 4: 26326 Segmentation fault  ${dir}$tst
>> FAIL: checksum
>> /bin/sh: line 4: 26359