Oh, should point out: I didn't deal with the potential btl/tcp issue you noted - I defer that to George
On Aug 1, 2014, at 7:56 AM, Ralph Castain <r...@open-mpi.org> wrote: > Hi Gilles > > I'm not sure if we have a problem or not - we'll have to wait and see, I > guess. So far, I'm not seeing any problems on x86 archs, but that's to be > expected and I don't have access to anything else. > > I fixed the issues you noted plus a few others I found. I imagine we'll > discover more as we go :-/ > > Thanks! > Ralph > > > On Aug 1, 2014, at 4:00 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> one last point : >> >> in orte_process_name_t, jobid and vpid have type orte_jobid_t and >> orte_vpid_t which really is uint32_t. >> >> in orte/util/proc.c, the function pointers opal_process_name_vpid and >> opal_process_name_jobid >> return an int32_t >> >> should it be an uint32_t instead ? >> /* and then _process_name_jobid_for_opal, _process_name_vpid_for_opal, >> opal_process_name_vpid_should_never_be_called >> should also be updated */ >> >> Cheers, >> >> Gilles >> >> On 2014/08/01 19:52, Gilles Gouaillardet wrote: >>> George and Ralph, >>> >>> i am very confused whether there is an issue or not. >>> >>> >>> anyway, today Paul and i ran basic tests on big endian machines and did >>> not face any issue related to big endianness. >>> >>> so i made my homework, digged into the code, and basically, >>> opal_process_name_t is used as an orte_process_name_t. >>> for example, in ompi_proc_init : >>> >>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->jobid = >>> OMPI_PROC_MY_NAME->jobid; >>> OMPI_CAST_ORTE_NAME(&proc->super.proc_name)->vpid = i; >>> >>> and with >>> >>> #define OMPI_CAST_ORTE_NAME(a) ((orte_process_name_t*)(a)) >>> >>> so as long as an opal_process_name_t is used as an orte_process_name_t, >>> there is no problem, >>> regardless the endianness of the homogenous cluster we are running on. >>> >>> for the sake of readability (and for being pedantic too ;-) ) in r32357, >>> &proc_temp->super.proc_name >>> could be replaced with >>> OMPI_CAST_ORTE_NAME(&proc_temp->super.proc_name) >>> >>> >>> >>> That being said, in btl/tcp, i noticed : >>> >>> in mca_btl_tcp_component_recv_handler : >>> >>> opal_process_name_t guid; >>> [...] >>> /* recv the process identifier */ >>> retval = recv(sd, (char *)&guid, sizeof(guid), 0); >>> if(retval != sizeof(guid)) { >>> CLOSE_THE_SOCKET(sd); >>> return; >>> } >>> OPAL_PROCESS_NAME_NTOH(guid); >>> >>> and in mca_btl_tcp_endpoint_send_connect_ack : >>> >>> /* send process identifier to remote endpoint */ >>> opal_process_name_t guid = btl_proc->proc_opal->proc_name; >>> >>> OPAL_PROCESS_NAME_HTON(guid); >>> if(mca_btl_tcp_endpoint_send_blocking(btl_endpoint, &guid, >>> sizeof(guid)) != >>> >>> and with >>> >>> #define OPAL_PROCESS_NAME_NTOH(guid) >>> #define OPAL_PROCESS_NAME_HTON(guid) >>> >>> >>> i had no time yet to test yet, but for now, i can only suspect : >>> - there will be an issue with the tcp btl on an heterogeneous cluster >>> - for this case, the fix is to have a different version of the >>> OPAL_PROCESS_NAME_xTOy >>> on little endian arch if heterogeneous mode is supported. >>> >>> >>> >>> does that make sense ? >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2014/07/31 1:29, George Bosilca wrote: >>>> The underlying structure changed, so a little bit of fiddling is normal. >>>> Instead of using a field in the ompi_proc_t you are now using a field down >>>> in opal_proc_t, a field that simply cannot have the same type as before >>>> (orte_process_name_t). >>>> >>>> George. >>>> >>>> >>>> >>>> On Wed, Jul 30, 2014 at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> George - my point was that we regularly tested using the method in that >>>>> routine, and now we have to do something a little different. So it is an >>>>> "issue" in that we have to make changes across the code base to ensure we >>>>> do things the "new" way, that's all >>>>> >>>>> On Jul 30, 2014, at 9:17 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>> No, this is not going to be an issue if the opal_identifier_t is used >>>>> correctly (aka only via the exposed accessors). >>>>> >>>>> George. >>>>> >>>>> >>>>> >>>>> On Wed, Jul 30, 2014 at 12:09 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> Yeah, my fix won't work for big endian machines - this is going to be an >>>>>> issue across the code base now, so we'll have to troll and fix it. I was >>>>>> doing the minimal change required to fix the trunk in the meantime. >>>>>> >>>>>> On Jul 30, 2014, at 9:06 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>>> >>>>>> Yes. opal_process_name_t has basically no meaning by itself, it is a 64 >>>>>> bits storage location used by the upper layer to save some local key that >>>>>> can be later used to extract information. Calling the OPAL level compare >>>>>> function might be a better fit there. >>>>>> >>>>>> George. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jul 30, 2014 at 11:50 AM, Gilles Gouaillardet < >>>>>> gilles.gouaillar...@gmail.com> wrote: >>>>>> >>>>>>> Ralph, >>>>>>> >>>>>>> was it really that simple ? >>>>>>> >>>>>>> proc_temp->super.proc_name has type opal_process_name_t : >>>>>>> typedef opal_identifier_t opal_process_name_t; >>>>>>> typedef uint64_t opal_identifier_t; >>>>>>> >>>>>>> *but* >>>>>>> >>>>>>> item_ptr->peer has type orte_process_name_t : >>>>>>> struct orte_process_name_t { >>>>>>> orte_jobid_t jobid; >>>>>>> orte_vpid_t vpid; >>>>>>> }; >>>>>>> >>>>>>> bottom line, is r32357 still valid on a big endian arch ? >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> >>>>>>> On Wed, Jul 30, 2014 at 11:49 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I just fixed this one - all that was required was an ampersand as the >>>>>>>> name was being passed into the function instead of a pointer to the >>>>>>>> name >>>>>>>> >>>>>>>> r32357 >>>>>>>> >>>>>>>> On Jul 30, 2014, at 7:43 AM, Gilles GOUAILLARDET < >>>>>>>> gilles.gouaillar...@gmail.com> wrote: >>>>>>>> >>>>>>>> Rolf, >>>>>>>> >>>>>>>> r32353 can be seen as a suspect... >>>>>>>> Even if it is correct, it might have exposed the bug discussed in #4815 >>>>>>>> even more (e.g. we hit the bug 100% after the fix) >>>>>>>> >>>>>>>> does the attached patch to #4815 fixes the problem ? >>>>>>>> >>>>>>>> If yes, and if you see this issue as a showstopper, feel free to commit >>>>>>>> it and drop a note to #4815 >>>>>>>> ( I am afk until tomorrow) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> Rolf vandeVaart <rvandeva...@nvidia.com> wrote: >>>>>>>> >>>>>>>> Just an FYI that my trunk version (r32355) does not work at all anymore >>>>>>>> if I do not include "--mca coll ^ml". Here is a stack trace from the >>>>>>>> ibm/pt2pt/send test running on a single node. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> (gdb) where >>>>>>>> >>>>>>>> #0 0x00007f6c0d1321d0 in ?? () >>>>>>>> >>>>>>>> #1 <signal handler called> >>>>>>>> >>>>>>>> #2 0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15 >>>>>>>> '\017', name1=0x192350001, name2=0xbaf76c) at >>>>>>>> ../../orte/util/name_fns.c:522 >>>>>>>> >>>>>>>> #3 0x00007f6c0bea17be in bcol_basesmuma_smcm_allgather_connection >>>>>>>> (sm_bcol_module=0x7f6bf3b68040, module=0xb3d200, >>>>>>>> peer_list=0x7f6c0c0a6748, >>>>>>>> back_files=0x7f6bf3ffd6c8, >>>>>>>> >>>>>>>> comm=0x6037a0, input=..., base_fname=0x7f6c0bea2606 >>>>>>>> "sm_payload_mem_", map_all=false) at >>>>>>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_smcm.c:237 >>>>>>>> >>>>>>>> #4 0x00007f6c0be98307 in bcol_basesmuma_bank_init_opti >>>>>>>> (payload_block=0xbc0f60, data_offset=64, bcol_module=0x7f6bf3b68040, >>>>>>>> reg_data=0xba28c0) >>>>>>>> >>>>>>>> at >>>>>>>> ../../../../../ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:302 >>>>>>>> >>>>>>>> #5 0x00007f6c0cced386 in mca_coll_ml_register_bcols >>>>>>>> (ml_module=0xba5c40) at >>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:510 >>>>>>>> >>>>>>>> #6 0x00007f6c0cced68f in ml_module_memory_initialization >>>>>>>> (ml_module=0xba5c40) at >>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:558 >>>>>>>> >>>>>>>> #7 0x00007f6c0ccf06b1 in ml_discover_hierarchy (ml_module=0xba5c40) at >>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:1539 >>>>>>>> >>>>>>>> #8 0x00007f6c0ccf4e0b in mca_coll_ml_comm_query (comm=0x6037a0, >>>>>>>> priority=0x7fffe7991b58) at >>>>>>>> ../../../../../ompi/mca/coll/ml/coll_ml_module.c:2963 >>>>>>>> >>>>>>>> #9 0x00007f6c18cc5b09 in query_2_0_0 (component=0x7f6c0cf50940, >>>>>>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90) >>>>>>>> >>>>>>>> at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:372 >>>>>>>> >>>>>>>> #10 0x00007f6c18cc5ac8 in query (component=0x7f6c0cf50940, >>>>>>>> comm=0x6037a0, priority=0x7fffe7991b58, module=0x7fffe7991b90) >>>>>>>> >>>>>>>> at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:355 >>>>>>>> >>>>>>>> #11 0x00007f6c18cc59d2 in check_one_component (comm=0x6037a0, >>>>>>>> component=0x7f6c0cf50940, module=0x7fffe7991b90) >>>>>>>> >>>>>>>> at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:317 >>>>>>>> >>>>>>>> #12 0x00007f6c18cc5818 in check_components (components=0x7f6c18f46ef0, >>>>>>>> comm=0x6037a0) at >>>>>>>> ../../../../ompi/mca/coll/base/coll_base_comm_select.c:281 >>>>>>>> >>>>>>>> #13 0x00007f6c18cbe3c9 in mca_coll_base_comm_select (comm=0x6037a0) at >>>>>>>> ../../../../ompi/mca/coll/base/coll_base_comm_select.c:117 >>>>>>>> >>>>>>>> #14 0x00007f6c18c52301 in ompi_mpi_init (argc=1, argv=0x7fffe79924c8, >>>>>>>> requested=0, provided=0x7fffe79922e8) at >>>>>>>> ../../ompi/runtime/ompi_mpi_init.c:918 >>>>>>>> >>>>>>>> #15 0x00007f6c18c86e92 in PMPI_Init (argc=0x7fffe799234c, >>>>>>>> argv=0x7fffe7992340) at pinit.c:84 >>>>>>>> >>>>>>>> #16 0x0000000000401056 in main (argc=1, argv=0x7fffe79924c8) at >>>>>>>> send.c:32 >>>>>>>> >>>>>>>> (gdb) up >>>>>>>> >>>>>>>> #1 <signal handler called> >>>>>>>> >>>>>>>> (gdb) up >>>>>>>> >>>>>>>> #2 0x00007f6c183abd52 in orte_util_compare_name_fields (fields=15 >>>>>>>> '\017', name1=0x192350001, name2=0xbaf76c) at >>>>>>>> ../../orte/util/name_fns.c:522 >>>>>>>> >>>>>>>> 522 if (name1->jobid < name2->jobid) { >>>>>>>> >>>>>>>> (gdb) print name1 >>>>>>>> >>>>>>>> $1 = (const orte_process_name_t *) 0x192350001 >>>>>>>> >>>>>>>> (gdb) print *name1 >>>>>>>> >>>>>>>> Cannot access memory at address 0x192350001 >>>>>>>> >>>>>>>> (gdb) print name2 >>>>>>>> >>>>>>>> $2 = (const orte_process_name_t *) 0xbaf76c >>>>>>>> >>>>>>>> (gdb) print *name2 >>>>>>>> >>>>>>>> $3 = {jobid = 2452946945, vpid = 1} >>>>>>>> >>>>>>>> (gdb) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: devel [mailto:devel-boun...@open-mpi.org >>>>>>>> <devel-boun...@open-mpi.org>] On Behalf Of Gilles >>>>>>>> >>>>>>>>> Gouaillardet >>>>>>>>> Sent: Wednesday, July 30, 2014 2:16 AM >>>>>>>>> To: Open MPI Developers >>>>>>>>> Subject: Re: [OMPI devel] trunk compilation errors in jenkins >>>>>>>>> George, >>>>>>>>> #4815 is indirectly related to the move : >>>>>>>>> in bcol/basesmuma, we used to compare ompi_process_name_t, and now >>>>>>>>> we (try to) compare an ompi_process_name_t and an opal_process_name_t >>>>>>>>> (which causes a glory SIGSEGV) >>>>>>>>> i proposed a temporary patch which is both broken and unelegant, could >>>>>>>> you >>>>>>>> >>>>>>>>> please advise a correct solution ? >>>>>>>>> Cheers, >>>>>>>>> Gilles >>>>>>>>> On 2014/07/27 7:37, George Bosilca wrote: >>>>>>>>>> If you have any issue with the move, I'll be happy to help and/or >>>>>>>> support >>>>>>>> >>>>>>>>> you on your last move toward a completely generic BTL. To facilitate >>>>>>>> your >>>>>>>> >>>>>>>>> work I exposed a minimalistic set of OMPI information at the OPAL >>>>>>>> level. Take >>>>>>>> >>>>>>>>> a look at opal/util/proc.h for more info, but please try not to expose >>>>>>>> more. >>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: http://www.open- >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php> >>>>>>>> >>>>>>>>> mpi.org/community/lists/devel/2014/07/15348.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/07/15348.php> >>>>>>>> ------------------------------ >>>>>>>> This email message is for the sole use of the intended recipient(s) >>>>>>>> and may contain confidential information. Any unauthorized review, >>>>>>>> use, >>>>>>>> disclosure or distribution is prohibited. If you are not the intended >>>>>>>> recipient, please contact the sender by reply email and destroy all >>>>>>>> copies >>>>>>>> of the original message. >>>>>>>> ------------------------------ >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15355.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15356.php >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15363.php >>>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15364.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15365.php >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15366.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/07/15367.php >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15368.php >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15446.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15447.php >