Hi Gilles

I'm not sure if we have a problem or not - we'll have to wait and see, I guess. 
So far, I'm not seeing any problems on x86 archs, but that's to be expected and 
I don't have access to anything else.

I fixed the issues you noted plus a few others I found. I imagine we'll 
discover more as we go :-/

Thanks!
Ralph


On Aug 1, 2014, at 4:00 AM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> 
wrote:

> one last point :
> 
> in orte_process_name_t, jobid and vpid have type orte_jobid_t and orte_vpid_t 
> which really is uint32_t.
> 
> in orte/util/proc.c, the function pointers opal_process_name_vpid and 
> opal_process_name_jobid
> return an int32_t
> 
> should it be an uint32_t instead ?
> /* and then _process_name_jobid_for_opal, _process_name_vpid_for_opal, 
> opal_process_name_vpid_should_never_be_called
> should also be updated */
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/01 19:52, Gilles Gouaillardet wrote:
>> George and Ralph,
>> 
>> i am very confused whether there is an issue or not.
>> 
>> 
>> anyway, today Paul and i ran basic tests on big endian machines and did
>> not face any issue related to big endianness.
>> 
>> so i made my homework, digged into the code, and basically,
>> opal_process_name_t is used as an orte_process_name_t.
>> for example, in ompi_proc_init :
>> 
>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->jobid =
>> OMPI_PROC_MY_NAME->jobid;
>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->vpid = i;
>> 
>> and with
>> 
>> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a))
>> 
>> so as long as an opal_process_name_t is used as an orte_process_name_t,
>> there is no problem,
>> regardless the endianness of the homogenous cluster we are running on.
>> 
>> for the sake of readability (and for being pedantic too ;-) ) in r32357,
>> &proc_temp->super.proc_name
>> could be replaced with
>> OMPI_CAST_ORTE_NAME(&proc_temp->super.proc_name)
>> 
>> 
>> 
>> That being said, in btl/tcp, i noticed :
>> 
>> in mca_btl_tcp_component_recv_handler :
>> 
>>     opal_process_name_t guid;
>> [...]
>>     /* recv the process identifier */
>>     retval = recv(sd, (char *)&guid, sizeof(guid), 0);
>>     if(retval != sizeof(guid)) {
>>         CLOSE_THE_SOCKET(sd);
>>         return;
>>     }
>>     OPAL_PROCESS_NAME_NTOH(guid);
>> 
>> and in mca_btl_tcp_endpoint_send_connect_ack :
>> 
>>     /* send process identifier to remote endpoint */
>>     opal_process_name_t guid = btl_proc->proc_opal->proc_name;
>> 
>>     OPAL_PROCESS_NAME_HTON(guid);
>>     if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, &guid,
>> sizeof(guid)) !=
>> 
>> and with
>> 
>> #define OPAL_PROCESS_NAME_NTOH(guid)
>> #define OPAL_PROCESS_NAME_HTON(guid)
>> 
>> 
>> i had no time yet to test yet, but for now, i can only suspect :
>> - there will be an issue with the tcp btl on an heterogeneous cluster
>> - for this case, the fix is to have a different version of the
>> OPAL_PROCESS_NAME_xTOy
>>   on little endian arch if heterogeneous mode is supported.
>> 
>> 
>> 
>> does that make sense ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> On 2014/07/31 1:29, George Bosilca wrote:
>>> The underlying structure changed, so a little bit of fiddling is normal.
>>> Instead of using a field in the ompi_proc_t you are now using a field down
>>> in opal_proc_t, a field that simply cannot have the same type as before
>>> (orte_process_name_t).
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> George - my point was that we regularly tested using the method in that
>>>> routine, and now we have to do something a little different. So it is an
>>>> "issue" in that we have to make changes across the code base to ensure we
>>>> do things the "new" way, that's all
>>>> 
>>>> On Jul 30, 2014, at 9:17 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>> 
>>>> No, this is not going to be an issue if the opal_identifier_t is used
>>>> correctly (aka only via the exposed accessors).
>>>> 
>>>>   George.
>>>> 
>>>> 
>>>> 
>>>> On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>>> Yeah, my fix won't work for big endian machines - this is going to be an
>>>>> issue across the code base now, so we'll have to troll and fix it. I was
>>>>> doing the minimal change required to fix the trunk in the meantime.
>>>>> 
>>>>> On Jul 30, 2014, at 9:06 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>> 
>>>>> Yes. opal_process_name_t has basically no meaning by itself, it is a 64
>>>>> bits storage location used by the upper layer to save some local key that
>>>>> can be later used to extract information. Calling the OPAL level compare
>>>>> function might be a better fit there.
>>>>> 
>>>>>   George.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Jul 30, 2014 at 11:50 AM, Gilles Gouaillardet <
>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>> 
>>>>>> Ralph,
>>>>>> 
>>>>>> was it really that simple ?
>>>>>> 
>>>>>> proc_temp->super.proc_name has type opal_process_name_t :
>>>>>> typedef opal_identifier_t opal_process_name_t;
>>>>>> typedef uint64_t opal_identifier_t;
>>>>>> 
>>>>>> *but*
>>>>>> 
>>>>>> item_ptr->peer has type orte_process_name_t :
>>>>>> struct orte_process_name_t {
>>>>>>    orte_jobid_t jobid;
>>>>>>    orte_vpid_t vpid;
>>>>>> };
>>>>>> 
>>>>>> bottom line, is r32357 still valid on a big endian arch ?
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jul 30, 2014 at 11:49 PM, Ralph Castain <r...@open-mpi.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> I just fixed this one - all that was required was an ampersand as the
>>>>>>> name was being passed into the function instead of a pointer to the name
>>>>>>> 
>>>>>>> r32357
>>>>>>> 
>>>>>>> On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET <
>>>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Rolf,
>>>>>>> 
>>>>>>> r32353 can be seen as a suspect...
>>>>>>> Even if it is correct, it might have exposed the bug discussed in #4815
>>>>>>> even more (e.g. we hit the bug 100% after the fix)
>>>>>>> 
>>>>>>> does the attached patch to #4815 fixes the problem ?
>>>>>>> 
>>>>>>> If yes, and if you see this issue as a showstopper, feel free to commit
>>>>>>> it and drop a note to #4815
>>>>>>> ( I am afk until tomorrow)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
>>>>>>> 
>>>>>>> Just an FYI that my trunk version (r32355) does not work at all anymore
>>>>>>> if I do not include "--mca coll ^ml".    Here is a stack trace from the
>>>>>>> ibm/pt2pt/send test running on a single node.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> (gdb) where
>>>>>>> 
>>>>>>> #0  0x00007f6c0d1321d0 in ?? ()
>>>>>>> 
>>>>>>> #1  <signal handler called>
>>>>>>> 
>>>>>>> #2  0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15
>>>>>>> '\017', name1=0x192350001, name2=0xbaf76c) at 
>>>>>>> ../../orte/util/name_fns.c:522
>>>>>>> 
>>>>>>> #3  0x00007f6c0bea17be in bcol_basesmuma_smcm_allgather_connection
>>>>>>> (sm_bcol_module=0x7f6bf3b68040, module=0xb3d200, 
>>>>>>> peer_list=0x7f6c0c0a6748,
>>>>>>> back_files=0x7f6bf3ffd6c8,
>>>>>>> 
>>>>>>>     comm=0x6037a0, input=..., base_fname=0x7f6c0bea2606
>>>>>>> "sm_payload_mem_", map_all=false) at
>>>>>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_smcm.c:237
>>>>>>> 
>>>>>>> #4  0x00007f6c0be98307 in bcol_basesmuma_bank_init_opti
>>>>>>> (payload_block=0xbc0f60, data_offset=64, bcol_module=0x7f6bf3b68040,
>>>>>>> reg_data=0xba28c0)
>>>>>>> 
>>>>>>>     at
>>>>>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:302
>>>>>>> 
>>>>>>> #5  0x00007f6c0cced386 in mca_coll_ml_register_bcols
>>>>>>> (ml_module=0xba5c40) at 
>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:510
>>>>>>> 
>>>>>>> #6  0x00007f6c0cced68f in ml_module_memory_initialization
>>>>>>> (ml_module=0xba5c40) at 
>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:558
>>>>>>> 
>>>>>>> #7  0x00007f6c0ccf06b1 in ml_discover_hierarchy (ml_module=0xba5c40) at
>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:1539
>>>>>>> 
>>>>>>> #8  0x00007f6c0ccf4e0b in mca_coll_ml_comm_query (comm=0x6037a0,
>>>>>>> priority=0x7fffe7991b58) at
>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:2963
>>>>>>> 
>>>>>>> #9  0x00007f6c18cc5b09 in query_2_0_0 (component=0x7f6c0cf50940,
>>>>>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90)
>>>>>>> 
>>>>>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:372
>>>>>>> 
>>>>>>> #10 0x00007f6c18cc5ac8 in query (component=0x7f6c0cf50940,
>>>>>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90)
>>>>>>> 
>>>>>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:355
>>>>>>> 
>>>>>>> #11 0x00007f6c18cc59d2 in check_one_component (comm=0x6037a0,
>>>>>>> component=0x7f6c0cf50940, module=0x7fffe7991b90)
>>>>>>> 
>>>>>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:317
>>>>>>> 
>>>>>>> #12 0x00007f6c18cc5818 in check_components (components=0x7f6c18f46ef0,
>>>>>>> comm=0x6037a0) at 
>>>>>>> ../../../../ompi/mca/coll/base/coll_base_comm_select.c:281
>>>>>>> 
>>>>>>> #13 0x00007f6c18cbe3c9 in mca_coll_base_comm_select (comm=0x6037a0) at
>>>>>>> ../../../../ompi/mca/coll/base/coll_base_comm_select.c:117
>>>>>>> 
>>>>>>> #14 0x00007f6c18c52301 in ompi_mpi_init (argc=1, argv=0x7fffe79924c8,
>>>>>>> requested=0, provided=0x7fffe79922e8) at
>>>>>>> ../../ompi/runtime/ompi_mpi_init.c:918
>>>>>>> 
>>>>>>> #15 0x00007f6c18c86e92 in PMPI_Init (argc=0x7fffe799234c,
>>>>>>> argv=0x7fffe7992340) at pinit.c:84
>>>>>>> 
>>>>>>> #16 0x0000000000401056 in main (argc=1, argv=0x7fffe79924c8) at
>>>>>>> send.c:32
>>>>>>> 
>>>>>>> (gdb) up
>>>>>>> 
>>>>>>> #1  <signal handler called>
>>>>>>> 
>>>>>>> (gdb) up
>>>>>>> 
>>>>>>> #2  0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15
>>>>>>> '\017', name1=0x192350001, name2=0xbaf76c) at 
>>>>>>> ../../orte/util/name_fns.c:522
>>>>>>> 
>>>>>>> 522           if (name1->jobid < name2->jobid) {
>>>>>>> 
>>>>>>> (gdb) print name1
>>>>>>> 
>>>>>>> $1 = (const orte_process_name_t *) 0x192350001
>>>>>>> 
>>>>>>> (gdb) print *name1
>>>>>>> 
>>>>>>> Cannot access memory at address 0x192350001
>>>>>>> 
>>>>>>> (gdb) print name2
>>>>>>> 
>>>>>>> $2 = (const orte_process_name_t *) 0xbaf76c
>>>>>>> 
>>>>>>> (gdb) print *name2
>>>>>>> 
>>>>>>> $3 = {jobid = 2452946945, vpid = 1}
>>>>>>> 
>>>>>>> (gdb)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: devel [mailto:devel-boun...@open-mpi.org
>>>>>>> <devel-boun...@open-mpi.org>] On Behalf Of Gilles
>>>>>>> 
>>>>>>>> Gouaillardet
>>>>>>>> Sent: Wednesday, July 30, 2014 2:16 AM
>>>>>>>> To: Open MPI Developers
>>>>>>>> Subject: Re: [OMPI devel] trunk compilation errors in jenkins
>>>>>>>> George,
>>>>>>>> #4815 is indirectly related to the move :
>>>>>>>> in bcol/basesmuma, we used to compare ompi_process_name_t, and now
>>>>>>>> we (try to) compare an ompi_process_name_t and an opal_process_name_t
>>>>>>>> (which causes a glory SIGSEGV)
>>>>>>>> i proposed a temporary patch which is both broken and unelegant, could
>>>>>>> you
>>>>>>> 
>>>>>>>> please advise a correct solution ?
>>>>>>>> Cheers,
>>>>>>>> Gilles
>>>>>>>> On 2014/07/27 7:37, George Bosilca wrote:
>>>>>>>>> If you have any issue with the move, I'll be happy to help and/or
>>>>>>> support
>>>>>>> 
>>>>>>>> you on your last move toward a completely generic BTL. To facilitate
>>>>>>> your
>>>>>>> 
>>>>>>>> work I exposed a minimalistic set of OMPI information at the OPAL
>>>>>>> level. Take
>>>>>>> 
>>>>>>>> a look at opal/util/proc.h for more info, but please try not to expose
>>>>>>> more.
>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post: http://www.open-
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php>
>>>>>>> 
>>>>>>>> mpi.org/community/lists/devel/2014/07/15348.php
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php>
>>>>>>>  ------------------------------
>>>>>>>  This email message is for the sole use of the intended recipient(s)
>>>>>>> and may contain confidential information.  Any unauthorized review, use,
>>>>>>> disclosure or distribution is prohibited.  If you are not the intended
>>>>>>> recipient, please contact the sender by reply email and destroy all 
>>>>>>> copies
>>>>>>> of the original message.
>>>>>>>  ------------------------------
>>>>>>>  _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15355.php
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15356.php
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15363.php
>>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15364.php
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15365.php
>>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15366.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15367.php
>>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/07/15368.php
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15446.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15447.php

Reply via email to