The code committed by Gilles is correctly protected for big endian (
https://svn.open-mpi.org/trac/ompi/changeset/32425). I was merely pointing
out that I think he should also swap the 2 32 bits in his implementation.

  George.



On Tue, Aug 5, 2014 at 1:32 PM, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Aug 5, 2014, at 10:23 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> On Tue, Aug 5, 2014 at 1:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Hmmm...wouldn't that then require that you know (a) the other side is
>> little endian, and (b) that you are on a big endian? Otherwise, you wind up
>> with the same issue in reverse, yes?
>>
>
> This is similar to the 32 bits ntohl that we are using in other parts of
> the project. Any  little endian participant will do the conversion, while
> every big endian participant will use an empty macro instead.
>
>
>> In the ORTE methods, we explicitly set the fields (e.g., jobid =
>> ntohl(remote-jobid)) to get around this problem. I missed that he did it by
>> location instead of named fields - perhaps we should do that instead?
>>
>
> As soon as we impose the ORTE naming scheme at the OPAL level (aka. the
> notion of jobid and vpid) this approach will become possible.
>
>
> Not proposing that at all so long as the other method will work without
> knowing the other side's endianness. Sounds like your approach should work
> fine as long as Gilles adds a #if so big endian defines the macro away
>
>
>   George.
>
>
>
>>
>>
>> On Aug 5, 2014, at 10:06 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>> Technically speaking, converting a 64 bits to a big endian representation
>> requires the swap of the 2 32 bits parts. So the correct approach would
>> have been:
>> uint64_t htonll(uint64_t v)
>> {
>>     return ((((uint64_t)ntohl(n)) << 32 | (uint64_t)ntohl(n >> 32));
>> }
>>
>>   George.
>>
>>
>>
>> On Tue, Aug 5, 2014 at 5:52 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> FWIW: that's exactly how we do it in ORTE
>>>
>>> On Aug 4, 2014, at 10:25 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@iferc.org> wrote:
>>>
>>> George,
>>>
>>> i confirm there was a problem when running on an heterogeneous cluster,
>>> this is now fixed in r32425.
>>>
>>> i am not convinced i chose the most elegant way to achieve the desired
>>> result ...
>>> could you please double check this commit ?
>>>
>>> Thanks,
>>>
>>> Gilles
>>>
>>> On 2014/08/02 0:14, George Bosilca wrote:
>>>
>>> Gilles,
>>>
>>> The design of the BTL move was to let the opal_process_name_t be agnostic 
>>> to what is stored inside, and all accesses should be done through the 
>>> provided accessors. Thus, big endian or little endian doesn’t make a 
>>> difference, as long as everything goes through the accessors.
>>>
>>> I’m skeptical about the support of heterogeneous environments in the 
>>> current code, so I didn’t pay much attention to handling the case in the 
>>> TCP BTL. But in case we do care it is enough to make  the 2 macros point to 
>>> something meaningful instead of being empty (bswap_64 or something).
>>>
>>>   George.
>>>
>>> On Aug 1, 2014, at 06:52 , Gilles Gouaillardet 
>>> <gilles.gouaillar...@iferc.org> <gilles.gouaillar...@iferc.org> wrote:
>>>
>>>
>>> George and Ralph,
>>>
>>> i am very confused whether there is an issue or not.
>>>
>>>
>>> anyway, today Paul and i ran basic tests on big endian machines and did not 
>>> face any issue related to big endianness.
>>>
>>> so i made my homework, digged into the code, and basically, 
>>> opal_process_name_t is used as an orte_process_name_t.
>>> for example, in ompi_proc_init :
>>>
>>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->jobid = 
>>> OMPI_PROC_MY_NAME->jobid;
>>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->vpid = i;
>>>
>>> and with
>>>
>>> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a))
>>>
>>> so as long as an opal_process_name_t is used as an orte_process_name_t, 
>>> there is no problem,
>>> regardless the endianness of the homogenous cluster we are running on.
>>>
>>> for the sake of readability (and for being pedantic too ;-) ) in r32357,
>>> &proc_temp->super.proc_name
>>> could be replaced with
>>> OMPI_CAST_ORTE_NAME(&proc_temp->super.proc_name)
>>>
>>>
>>>
>>> That being said, in btl/tcp, i noticed :
>>>
>>> in mca_btl_tcp_component_recv_handler :
>>>
>>>     opal_process_name_t guid;
>>> [...]
>>>     /* recv the process identifier */
>>>     retval = recv(sd, (char *)&guid, sizeof(guid), 0);
>>>     if(retval != sizeof(guid)) {
>>>         CLOSE_THE_SOCKET(sd);
>>>         return;
>>>     }
>>>     OPAL_PROCESS_NAME_NTOH(guid);
>>>
>>> and in mca_btl_tcp_endpoint_send_connect_ack :
>>>
>>>     /* send process identifier to remote endpoint */
>>>     opal_process_name_t guid = btl_proc->proc_opal->proc_name;
>>>
>>>     OPAL_PROCESS_NAME_HTON(guid);
>>>     if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, &guid, 
>>> sizeof(guid)) !=
>>>
>>> and with
>>>
>>> #define OPAL_PROCESS_NAME_NTOH(guid)
>>> #define OPAL_PROCESS_NAME_HTON(guid)
>>>
>>>
>>> i had no time yet to test yet, but for now, i can only suspect :
>>> - there will be an issue with the tcp btl on an heterogeneous cluster
>>> - for this case, the fix is to have a different version of the 
>>> OPAL_PROCESS_NAME_xTOy
>>>   on little endian arch if heterogeneous mode is supported.
>>>
>>>
>>>
>>> does that make sense ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 2014/07/31 1:29, George Bosilca wrote:
>>>
>>> The underlying structure changed, so a little bit of fiddling is normal.
>>> Instead of using a field in the ompi_proc_t you are now using a field down
>>> in opal_proc_t, a field that simply cannot have the same type as before
>>> (orte_process_name_t).
>>>
>>>   George.
>>>
>>>
>>>
>>> On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain <r...@open-mpi.org> 
>>> <r...@open-mpi.org> wrote:
>>>
>>>
>>> George - my point was that we regularly tested using the method in that
>>> routine, and now we have to do something a little different. So it is an
>>> "issue" in that we have to make changes across the code base to ensure we
>>> do things the "new" way, that's all
>>>
>>> On Jul 30, 2014, at 9:17 AM, George Bosilca <bosi...@icl.utk.edu> 
>>> <bosi...@icl.utk.edu> wrote:
>>>
>>> No, this is not going to be an issue if the opal_identifier_t is used
>>> correctly (aka only via the exposed accessors).
>>>
>>>   George.
>>>
>>>
>>>
>>> On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain <r...@open-mpi.org> 
>>> <r...@open-mpi.org> wrote:
>>>
>>>
>>> Yeah, my fix won't work for big endian machines - this is going to be an
>>> issue across the code base now, so we'll have to troll and fix it. I was
>>> doing the minimal change required to fix the trunk in the meantime.
>>>
>>> On Jul 30, 2014, at 9:06 AM, George Bosilca <bosi...@icl.utk.edu> 
>>> <bosi...@icl.utk.edu> wrote:
>>>
>>> Yes. opal_process_name_t has basically no meaning by itself, it is a 64
>>> bits storage location used by the upper layer to save some local key that
>>> can be later used to extract information. Calling the OPAL level compare
>>> function might be a better fit there.
>>>
>>>   George.
>>>
>>>
>>>
>>> On Wed, Jul 30, 2014 at 11:50 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>>
>>>
>>> Ralph,
>>>
>>> was it really that simple ?
>>>
>>> proc_temp->super.proc_name has type opal_process_name_t :
>>> typedef opal_identifier_t opal_process_name_t;
>>> typedef uint64_t opal_identifier_t;
>>>
>>> *but*
>>>
>>> item_ptr->peer has type orte_process_name_t :
>>> struct orte_process_name_t {
>>>    orte_jobid_t jobid;
>>>    orte_vpid_t vpid;
>>> };
>>>
>>> bottom line, is r32357 still valid on a big endian arch ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Wed, Jul 30, 2014 at 11:49 PM, Ralph Castain <r...@open-mpi.org> 
>>> <r...@open-mpi.org>
>>> wrote:
>>>
>>>
>>> I just fixed this one - all that was required was an ampersand as the
>>> name was being passed into the function instead of a pointer to the name
>>>
>>> r32357
>>>
>>> On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>>
>>> Rolf,
>>>
>>> r32353 can be seen as a suspect...
>>> Even if it is correct, it might have exposed the bug discussed in #4815
>>> even more (e.g. we hit the bug 100% after the fix)
>>>
>>> does the attached patch to #4815 fixes the problem ?
>>>
>>> If yes, and if you see this issue as a showstopper, feel free to commit
>>> it and drop a note to #4815
>>> ( I am afk until tomorrow)
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Rolf vandeVaart <rvandeva...@nvidia.com> <rvandeva...@nvidia.com> wrote:
>>>
>>> Just an FYI that my trunk version (r32355) does not work at all anymore
>>> if I do not include "--mca coll ^ml".    Here is a stack trace from the
>>> ibm/pt2pt/send test running on a single node.
>>>
>>>
>>>
>>> (gdb) where
>>>
>>> #0  0x00007f6c0d1321d0 in ?? ()
>>>
>>> #1  <signal handler called>
>>>
>>> #2  0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15
>>> '\017', name1=0x192350001, name2=0xbaf76c) at ../../orte/util/name_fns.c:522
>>>
>>> #3  0x00007f6c0bea17be in bcol_basesmuma_smcm_allgather_connection
>>> (sm_bcol_module=0x7f6bf3b68040, module=0xb3d200, peer_list=0x7f6c0c0a6748,
>>> back_files=0x7f6bf3ffd6c8,
>>>
>>>     comm=0x6037a0, input=..., base_fname=0x7f6c0bea2606
>>> "sm_payload_mem_", map_all=false) at
>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_smcm.c:237
>>>
>>> #4  0x00007f6c0be98307 in bcol_basesmuma_bank_init_opti
>>> (payload_block=0xbc0f60, data_offset=64, bcol_module=0x7f6bf3b68040,
>>> reg_data=0xba28c0)
>>>
>>>     at
>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:302
>>>
>>> #5  0x00007f6c0cced386 in mca_coll_ml_register_bcols
>>> (ml_module=0xba5c40) at ../../../../../ompi/mca/coll/ml/coll_ml_module.c:510
>>>
>>> #6  0x00007f6c0cced68f in ml_module_memory_initialization
>>> (ml_module=0xba5c40) at ../../../../../ompi/mca/coll/ml/coll_ml_module.c:558
>>>
>>> #7  0x00007f6c0ccf06b1 in ml_discover_hierarchy (ml_module=0xba5c40) at
>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:1539
>>>
>>> #8  0x00007f6c0ccf4e0b in mca_coll_ml_comm_query (comm=0x6037a0,
>>> priority=0x7fffe7991b58) at
>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:2963
>>>
>>> #9  0x00007f6c18cc5b09 in query_2_0_0 (component=0x7f6c0cf50940,
>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90)
>>>
>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:372
>>>
>>> #10 0x00007f6c18cc5ac8 in query (component=0x7f6c0cf50940,
>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90)
>>>
>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:355
>>>
>>> #11 0x00007f6c18cc59d2 in check_one_component (comm=0x6037a0,
>>> component=0x7f6c0cf50940, module=0x7fffe7991b90)
>>>
>>>     at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:317
>>>
>>> #12 0x00007f6c18cc5818 in check_components (components=0x7f6c18f46ef0,
>>> comm=0x6037a0) at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:281
>>>
>>> #13 0x00007f6c18cbe3c9 in mca_coll_base_comm_select (comm=0x6037a0) at
>>> ../../../../ompi/mca/coll/base/coll_base_comm_select.c:117
>>>
>>> #14 0x00007f6c18c52301 in ompi_mpi_init (argc=1, argv=0x7fffe79924c8,
>>> requested=0, provided=0x7fffe79922e8) at
>>> ../../ompi/runtime/ompi_mpi_init.c:918
>>>
>>> #15 0x00007f6c18c86e92 in PMPI_Init (argc=0x7fffe799234c,
>>> argv=0x7fffe7992340) at pinit.c:84
>>>
>>> #16 0x0000000000401056 in main (argc=1, argv=0x7fffe79924c8) at
>>> send.c:32
>>>
>>> (gdb) up
>>>
>>> #1  <signal handler called>
>>>
>>> (gdb) up
>>>
>>> #2  0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15
>>> '\017', name1=0x192350001, name2=0xbaf76c) at ../../orte/util/name_fns.c:522
>>>
>>> 522           if (name1->jobid < name2->jobid) {
>>>
>>> (gdb) print name1
>>>
>>> $1 = (const orte_process_name_t *) 0x192350001
>>>
>>> (gdb) print *name1
>>>
>>> Cannot access memory at address 0x192350001
>>>
>>> (gdb) print name2
>>>
>>> $2 = (const orte_process_name_t *) 0xbaf76c
>>>
>>> (gdb) print *name2
>>>
>>> $3 = {jobid = 2452946945, vpid = 1}
>>>
>>> (gdb)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: devel [mailto:devel-boun...@open-mpi.org <devel-boun...@open-mpi.org>
>>>
>>> <devel-boun...@open-mpi.org> <devel-boun...@open-mpi.org>] On Behalf Of 
>>> Gilles
>>>
>>>
>>> Gouaillardet
>>> Sent: Wednesday, July 30, 2014 2:16 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] trunk compilation errors in jenkins
>>> George,
>>> #4815 is indirectly related to the move :
>>> in bcol/basesmuma, we used to compare ompi_process_name_t, and now
>>> we (try to) compare an ompi_process_name_t and an opal_process_name_t
>>> (which causes a glory SIGSEGV)
>>> i proposed a temporary patch which is both broken and unelegant, could
>>>
>>> you
>>>
>>>
>>> please advise a correct solution ?
>>> Cheers,
>>> Gilles
>>> On 2014/07/27 7:37, George Bosilca wrote:
>>>
>>> If you have any issue with the move, I’ll be happy to help and/or
>>>
>>> support
>>>
>>>
>>> you on your last move toward a completely generic BTL. To facilitate
>>>
>>> your
>>>
>>>
>>> work I exposed a minimalistic set of OMPI information at the OPAL
>>>
>>> level. Take
>>>
>>>
>>> a look at opal/util/proc.h for more info, but please try not to expose
>>>
>>> more.
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: http://www.open-
>>>
>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php> 
>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php>
>>>
>>> mpi.org/community/lists/devel/2014/07/15348.php
>>>
>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php> 
>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php>
>>>  ------------------------------
>>>  This email message is for the sole use of the intended recipient(s)
>>> and may contain confidential information.  Any unauthorized review, use,
>>> disclosure or distribution is prohibited.  If you are not the intended
>>> recipient, please contact the sender by reply email and destroy all copies
>>> of the original message.
>>>  ------------------------------
>>>  _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15355.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15356.php
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15363.php
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15364.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15365.php
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15366.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this 
>>> post:http://www.open-mpi.org/community/lists/devel/2014/07/15367.php
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/07/15368.php
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15446.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15454.php
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15509.php
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15514.php
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15518.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15519.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15520.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15521.php
>

Reply via email to