[OMPI devel] circular library dependence prevents static link on Solaris-10/SPARC

2014-08-07 Thread Paul Hargrove
Testing r32448 on trunk for trac issue #4834, I encounter the following
which appears unrelated to #4834:

  CCLD orte-info
Undefined   first referenced
 symbol in file
ompi_proc_local_proc
 
/sandbox/hargrove/OMPI/openmpi-trunk-solaris10-sparcT2-ss12u3-v9-static/BLD/opal/.libs/libopen-pal.a(libmca_btl_sm_la-btl_sm_component.o)
ld: fatal: Symbol referencing errors. No output written to orte-info

Note that this is *static* linking.

This appears to indicate a call from OPAL to OMPI, and I am guessing this
is a side-effect of the BTL move.

Since OMPI contains (many) calls to OPAL this is a circular library
dependence.
Unfortunately, some linkers process their argument strictly left-to-right.
Thus if this dependence is not eliminated one may need "-lmpi -lopen-pal
-lmpi" (or similar) to resolve it.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 8:03 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> > * static linking failure - Gilles has posted a proposed fix, but
> somebody needs to approve and CMR it. Please see:
> > https://svn.open-mpi.org/trac/ompi/ticket/4834
>
> Jeff made a better fix (r32447) to which i added a minor correction
> (r32448).
> as far as i am concerned, i am fine with to approve #4841
>
> that being said, per Jeff's commit log :
> "This needs to soak for a day or two on the trunk before moving to the
> v1.8 branch"
>
> so you might want to wait a bit ...
>


I trust Jeff's judgment on the waiting (or not), but can report that except
for an unrelated issue on Solaris-10/SPARC (email coming soon) the changes
in r32447+r32448 resolve the issue on all the OSes I test.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Paul Hargrove
On Thu, Aug 7, 2014 at 8:03 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> > * Siegmar reports another alignment issue on Sparc
> > http://www.open-mpi.org/community/lists/users/2014/07/24869.php
> >
> Is there any chance r32449 fixes the issue ?
>
> i found the problem on Solaris/x86_64 but i have no way to test it on a
> Solaris/sparc box.
>

I have Solaris-10/SPARC, just as Siegmar reports using.
However, I don't have gcc-4.9.0 and doubt I can build it myself.

I will see if I can reproduce the problem with 1.8.2rc2 or rc3.
If so, then I'll give r32449 a try.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Gilles Gouaillardet
Ralph and all,


> * static linking failure - Gilles has posted a proposed fix, but somebody 
> needs to approve and CMR it. Please see:
> https://svn.open-mpi.org/trac/ompi/ticket/4834

Jeff made a better fix (r32447) to which i added a minor correction
(r32448).
as far as i am concerned, i am fine with to approve #4841

that being said, per Jeff's commit log :
"This needs to soak for a day or two on the trunk before moving to the
v1.8 branch"

so you might want to wait a bit ...

> * Siegmar reports another alignment issue on Sparc
> http://www.open-mpi.org/community/lists/users/2014/07/24869.php
>
Is there any chance r32449 fixes the issue ?

i found the problem on Solaris/x86_64 but i have no way to test it on a
Solaris/sparc box.

Cheers,

Gilles


Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Jeff Squyres (jsquyres)
On Aug 7, 2014, at 1:55 PM, Ralph Castain  wrote:

> * static linking failure - Gilles has posted a proposed fix, but somebody 
> needs to approve and CMR it. Please see:
> https://svn.open-mpi.org/trac/ompi/ticket/4834

Sorry for the hold up.  I just replied on 4834; I'm working on a new patch now.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Paul Hargrove
Ralph,

I will hopefully be able to test Gilles's patch for 4834 on applicable
systems today or tomorrow.

So, I can soon answer whether the patch fixes all the problems I reported.
However, I cannot speak at all to the desirability of the approach relative
to the build infrastructure.
I think Jeff may be best qualified to make that judgement.

-Paul


On Thu, Aug 7, 2014 at 10:55 AM, Ralph Castain  wrote:

> Hey folks
>
> I *really* need your help to get this release out the door. It remains
> stuck on two things:
>
> * static linking failure - Gilles has posted a proposed fix, but somebody
> needs to approve and CMR it. Please see:
> https://svn.open-mpi.org/trac/ompi/ticket/4834
>
> * fixes to coll/ml that expanded to fixing page alignment in general -
> someone needs to review/approve it:
> https://svn.open-mpi.org/trac/ompi/ticket/4826
>
>
> We also have three outstanding issues that may not make 1.8.2:
>
> * MPI-I/O issues - looks like ROMIO needs some patches, and OMPIO may have
> an issue:
> http://www.open-mpi.org/community/lists/users/2014/08/24934.php
>
> * Siegmar reports another alignment issue on Sparc
> http://www.open-mpi.org/community/lists/users/2014/07/24869.php
>
> * Siegmar reports an issue that looks like it relates to handling of
> boolean MCA params:
>  http://www.open-mpi.org/community/lists/users/2014/08/24944.php
>
>
> Can someone *please* help out with these?
> Ralph
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15533.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Pritchard Jr., Howard
Hi Ralph,

I'll review 4826 as proxy for hjelmn.  I'm just checking that
it builds on my system before saying okay.

Howard


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, August 07, 2014 11:55 AM
To: Open MPI Developers
Subject: [OMPI devel] v1.8.2 still held up...

Hey folks

I *really* need your help to get this release out the door. It remains stuck on 
two things:

* static linking failure - Gilles has posted a proposed fix, but somebody needs 
to approve and CMR it. Please see:
https://svn.open-mpi.org/trac/ompi/ticket/4834

* fixes to coll/ml that expanded to fixing page alignment in general - someone 
needs to review/approve it:
https://svn.open-mpi.org/trac/ompi/ticket/4826


We also have three outstanding issues that may not make 1.8.2:

* MPI-I/O issues - looks like ROMIO needs some patches, and OMPIO may have an 
issue:
http://www.open-mpi.org/community/lists/users/2014/08/24934.php

* Siegmar reports another alignment issue on Sparc
http://www.open-mpi.org/community/lists/users/2014/07/24869.php

* Siegmar reports an issue that looks like it relates to handling of boolean 
MCA params:
 http://www.open-mpi.org/community/lists/users/2014/08/24944.php


Can someone *please* help out with these?
Ralph



[OMPI devel] v1.8.2 still held up...

2014-08-07 Thread Ralph Castain
Hey folks

I *really* need your help to get this release out the door. It remains stuck on 
two things:

* static linking failure - Gilles has posted a proposed fix, but somebody needs 
to approve and CMR it. Please see:
https://svn.open-mpi.org/trac/ompi/ticket/4834

* fixes to coll/ml that expanded to fixing page alignment in general - someone 
needs to review/approve it:
https://svn.open-mpi.org/trac/ompi/ticket/4826


We also have three outstanding issues that may not make 1.8.2:

* MPI-I/O issues - looks like ROMIO needs some patches, and OMPIO may have an 
issue:
http://www.open-mpi.org/community/lists/users/2014/08/24934.php

* Siegmar reports another alignment issue on Sparc
http://www.open-mpi.org/community/lists/users/2014/07/24869.php

* Siegmar reports an issue that looks like it relates to handling of boolean 
MCA params:
 http://www.open-mpi.org/community/lists/users/2014/08/24944.php


Can someone *please* help out with these?
Ralph



Re: [OMPI devel] OMPI devel] trunk compilation errors in jenkins

2014-08-07 Thread Gilles Gouaillardet
Ralph and George,

here are attached two patches :
- heterogeneous.v1.patch : a cleanup of the previous patch
- heterogeneous.v2.patch : a new patch based on Ralph suggestion. i made
the minimal changes to move jobid and vpid into the OPAL layer.

Cheers,

Gilles

On 2014/08/07 11:27, Ralph Castain wrote:
> Are we maybe approaching this from the wrong direction? I ask because we had 
> to do some gyrations in the pmix framework to work around the difference in 
> naming schemes between OPAL and the rest of the code base, and now we have 
> more gyrations here.
>
> Given that the MPI and RTE layers both rely on the structured form of the 
> name, what about if we just mimic that down in OPAL? I think we could perhaps 
> do this in a way that still allows someone to overlay it with a 64-bit 
> unstructured identifier if they want, but that would put the extra work on 
> their side. In other words, we make it easy to work with the other parts of 
> our own code base, acknowledging that those wanting to do something else may 
> have to do some extra work.
>
> I ask because every resource manager out there assigns each process a jobid 
> and vpid in some form of integer format. So we have to absorb that 
> information in {jobid, vpid} format regardless of what we may want to do 
> internally. What we now have to do is immediately convert that into the 
> unstructured form for OPAL (where we take it in via PMI), then convert it 
> back to structured form when passing it up to ORTE so it can be handed to 
> OMPI, and then convert it back to unstructured form every time either OMPI or 
> ORTE accesses the OPAL layer.
>
> Seems awfully convoluted and error prone. Simplifying things for ourselves 
> might make more sense.
>
>
> On Aug 6, 2014, at 1:21 PM, George Bosilca  wrote:
>
>> Gilles,
>>
>> This looks right. It is really unfortunately that we have to change the 
>> definition of orte_process_name_t for big endian architectures, but I don't 
>> think there is a way around.
>>
>> Regarding your patch I have two comments:
>> 1. There is a flagrant lack of comments ... especially on the ORTE side
>> 2. at the OPAL level we are really implementing a htonll, and I really think 
>> we should stick to the POSIX prototype (aka. returning the changes value 
>> instead of doing things inplace).
>>
>>   George.
>>
>>
>>
>> On Wed, Aug 6, 2014 at 7:02 AM, Gilles Gouaillardet 
>>  wrote:
>> Ralph and George,
>>
>> here is attached a patch that fixes the heterogeneous support without the 
>> abstraction violation.
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/08/06 9:40, Gilles Gouaillardet wrote:
>>> hummm
>>>
>>> i intentionally did not swap the two 32 bits (!)
>>>
>>> from the top level, what we have is :
>>>
>>> typedef struct {
>>>union {
>>>   uint64_t opal;
>>>   struct {
>>>uint32_t jobid;
>>>uint32_t vpid;
>>>} orte;
>>> } meta_process_name_t;
>>>
>>> OPAL is agnostic about jobid and vpid.
>>> jobid and vpid are set in ORTE/MPI and OPAL is used only
>>> to transport the 64 bits
>>> /* opal_process_name_t and orte_process_name_t are often casted into each
>>> other */
>>> at ORTE/MPI level, jobid and vpid are set individually
>>> /* e.g. we do *not* do something like opal = jobid | (vpid<<32) */
>>> this is why everything works fine on homogeneous clusters regardless
>>> endianness.
>>>
>>> now in heterogeneous cluster, thing get a bit trickier ...
>>>
>>> i was initially unhappy with my commit and i think i found out why :
>>> this is an abstraction violation !
>>> the two 32 bits are not swapped by OPAL because this is what is expected by
>>> the ORTE/OMPI.
>>>
>>> now i d like to suggest the following lightweight approach :
>>>
>>> at OPAL, use #if protected htonll/ntohll
>>> (e.g. swap the two 32bits)
>>>
>>> do the trick at the ORTE level :
>>>
>>> simply replace
>>>
>>> struct orte_process_name_t {
>>> orte_jobid_t jobid;
>>> orte_vpid_t vpid;
>>> };
>>>
>>> with
>>>
>>> #if OPAL_ENABLE_HETEROGENEOUS_SUPPORT && !defined(WORDS_BIGENDIAN)
>>> struct orte_process_name_t {
>>> orte_vpid_t vpid;
>>> orte_jobid_t jobid;
>>> };
>>> #else
>>> struct orte_process_name_t {
>>> orte_jobid_t jobid;
>>> orte_vpid_t vpid;
>>> };
>>> #endif
>>>
>>>
>>> so we keep OPAL agnostic about how the uint64_t is really used at the upper
>>> level.
>>> an other option is to make OPAL aware of jobid and vpid but this is a bit
>>> more heavyweight imho.
>>>
>>> i'll try this today and make sure it works.
>>>
>>> any thoughts ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Wed, Aug 6, 2014 at 8:17 AM, Ralph Castain  wrote:
>>>
 Ah yes, so it is - sorry I missed that last test :-/

 On Aug 5, 2014, at 10:50 AM, George Bosilca  wrote:

 The code committed by Gilles is correctly protected for big endian (
 https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely
 pointing out that I think he should also swap the 2 32 bits in his
 impleme