Re: [OMPI devel] Remote key sizes
On 11/8/11 5:25 PM, "George Bosilca"wrote: >2. one sided: A quick look in the OSC seems to indicate there are some >special handling to be done in the RDMA one. Look at >ompi_osc_rdma_sendreq_t in osc_rdma_sendreq.h, it is using a trick to >store the remote segments. First, the mca_btl_base_segment_t are stored >at the end of the structure, in order to allow for dynamic allocation. >Second, OSC doesn't seems to manipulate pointers to >mca_btl_base_segment_t, but the content itself. I didn't went too deep >here, but I think particular attention should be payed to OSC. I don't entirely remember what I was doing when I wrote that code :). The OSC only does puts/gets from the initiator to a single segment on the target, so the component contains an array of segments, one per peer. I only do RDMA when the source is contiguous, so the one in the sendreq is the segment, not a malloc trick. I'm planning on rewriting the RDMA one-sided component to implement the MPI 3 semantics. I think we can make it a whole lot cleaner than the current implementation. Which means that if we come up with some rational semantics for dealing with segments, I can make it work. If we can get them implemented before January, even better. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] Remote key sizes
On Nov 8, 2011, at 10:36 , Nathan T. Hjelm wrote: > On Tue, 8 Nov 2011 06:36:03 -0800, Rolf vandeVaart> wrote: >>> george. >>> >>> PS: Regarding the hand-copy instead of the memcpy, we tried to avoid >> using >>> memcpy in performance critical codes, especially when we know the size of >>> the data and the alignment. This relieves the compiler of adding ugly >> intrinsics, >>> allowing it to nicely pipeline to load/stores. Anyway, with both >> approaches >>> you will copy more data than needed for all BTLs except uGNI. >> >> I was looking at a case in a BTL I was working on where I actually need > 64 >> bytes (yes, bytes) as the remote key size as opposed to the current 16 >> bytes (128 bits). >> Not sure how I can handle that yet. (I assume configure is my friend, > but >> even in that case, all headers will need to carry around the extra data.) >> > > I have been thinking about this a little bit. What I think should be done > (and I am sure George will disagree) is to allow BTLs to define how long a Well, I'm really sorry to deceive you … > segment is. The PML would then just memcpy the segments into the send > buffer (instead of copying each member). The only valid reason I can find now for having the seg_key as it is defined today is code simplicity. Read below you will understand. Otherwise I completely agree with you, the seg_key is something belonging to the BTLs, and all knowledge about should be limited to the BTLs (aka PML should just move it around). The solution you propose make sense… However, there are few things that I think make it more challenging to implement that it looks. 1. endianess: Apparently the BTL is already responsible of storing the key in network order, as no translation is done on the key in the PMLs. As I don't think any of them do, I will assume this is already [somehow] taken care of. 2. one sided: A quick look in the OSC seems to indicate there are some special handling to be done in the RDMA one. Look at ompi_osc_rdma_sendreq_t in osc_rdma_sendreq.h, it is using a trick to store the remote segments. First, the mca_btl_base_segment_t are stored at the end of the structure, in order to allow for dynamic allocation. Second, OSC doesn't seems to manipulate pointers to mca_btl_base_segment_t, but the content itself. I didn't went too deep here, but I think particular attention should be payed to OSC. 3. PML. In addition to seg_len we use the seg_addr field extensively all over the code base, so it should be exposed in the mca_btl_base_segment_t as well. 4. How do we keep the capability of dealing with multiple mca_btl_base_segment_t? Just imagine how the macro MCA_PML_OB1_COMPUTE_SEGMENT_LENGTH will look like… Everything else should be quite trivial ;) george. > For example mca_btl_base_segment_t would become: > > struct mca_btl_base_segment_t { >size_t seg_len; > }; > > since the pml needs the segment size (it does not need anything else). > > and then each btl would define its own segment like: > struct mca_btl_ugni_segment_t { >struct mca_btl_base_segment_t base; >gni_mem_handle_t seg_key; > }; > > and we would add: > size_t btl_segment_len; > > to the mca_btl_base_module_t or the base frag so the pml knows how much it > needs to copy. > > This design would address George's criticism of the length of the seg_key > and also allow BTLs to do what they need to. It would require a memcpy but > I disagree this would slow the critical path. Even if it does it would be > relatively minor (i think) and the flexibility is worth more in the long > run. > > -Nathan > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger changes
Now this thread is starting to read like an episode of The Big Bang Theory. One possible guess as to how/why MPICH has managed w/o "volatile" would be that they may pass less aggressive optimization flags to the compilers. It is a then a question of which MPI implementation is supporting a choice of compilers, not a selection of debuggers. -Paul On 11/8/2011 3:48 PM, George Bosilca wrote: I will therefore propose to forever ban all compiler guys from this time-space, as now we have the undeniable proof that they concoct an evil plan against us. Otherwise, I can't explain how MPICH never had to add volatile to these particular variables and still support all these debuggers… george. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] debugger changes
On Nov 8, 2011, at 18:32 , Ralph Castain wrote: > That was the experience - after thrashing for quite some time, we finally > found that the volatile qualifiers fixed the problem. Hence my request that > people check to see if anything is broken. I will therefore propose to forever ban all compiler guys from this time-space, as now we have the undeniable proof that they concoct an evil plan against us. Otherwise, I can't explain how MPICH never had to add volatile to these particular variables and still support all these debuggers… george. > > >> >> -Paul >> >> On 11/8/2011 2:46 PM, George Bosilca wrote: >>> This value is not even read by the debugger. It only check for it's >>> existence in the startup process, so I guess we're safe here as well. >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Future Technologies Group >> HPC Research Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25445
I do not recall, and from the code there is no obvious reason. However, being able to store multiple smaller members might be a good enough reason. Btw, we don't use the key8 at all. I guess we can clean that code up to only keep key32 and key64, eventually with the count to match up the right size ;) george. On Nov 8, 2011, at 18:11 , Nathan T. Hjelm wrote: > Ok, that makes sense. Is there a reason why the members were all set the be > the same size? > > Maybe seg_key should be: > > union { > uint8_t key8; > uint16_t key16; > uint32_t key32; > uint64_t key64; > struct { uint64_t value[2] } key128; > }; > > -Nathan > > On Tue, 8 Nov 2011 17:22:48 -0500, George Bosilca> wrote: >> Elements in an array are always stored in the expected [increasing] > order, >> regardless of the endianess of the architecture. Moreover, due to the >> alignment rules, all members in a union will start at the same address. >> >> It turns out there is no endianess conversion on the keys, so I suppose >> both peers have to somehow reach a consensus outside the PML. >> >> george. >> >> On Nov 8, 2011, at 08:57 , Nathan T. Hjelm wrote: >> >>> Sure, I can do that. My only concern is with sending between hosts of >>> different endianness. >>> >>> For example, if seg_key is 128 bits wide and the key32 is 64 bits then >> we >>> might run into this: >>> >>> Host 1: (big endian) >>> Set seg_key.key32[0] = 0x >>> >>> would result in seg_key: 0x 0x 0x 0x >>> >>> Host 2: (little endian) >>> Set seg_key.key32[0] = 0x1 >>> >>> would result in seg_key: 0x 0x 0x 0x >>> >>> If either host were to send the other one its seg_key and try to use the >>> key32 they would get garbage. I haven't tested this case yet but I can >> test >>> on a PPE of RR later today. >>> >>> -Nathan >>> >>> On Tue, 8 Nov 2011 08:26:04 -0500, Jeff Squyres >> wrote: On Nov 7, 2011, at 9:48 PM, Nathan T. Hjelm wrote: > In retrospect I should have done a RFC for the 3rd change with a short > timeout. At the time (operating on little sleep) it seemed like the commits > would have minimal impact. Please let me know if the commits have any > negative impact. FWIW, I think I'd like to see a rollback of the increase of array sizes >>> in the seg_key union. They weren't necessary and might be slightly misleading. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger changes
On Nov 8, 2011, at 3:56 PM, Paul H. Hargrove wrote: > In theory, might a sufficiently smart compiler and linker eliminate some > MPIR_* variables after optimization? If that could potentially be true, then > perhaps the volatile qualifier would prevent such a removal, which would > break the existence check(s) by the debugger? Just a thought. That was the experience - after thrashing for quite some time, we finally found that the volatile qualifiers fixed the problem. Hence my request that people check to see if anything is broken. > > -Paul > > On 11/8/2011 2:46 PM, George Bosilca wrote: >> This value is not even read by the debugger. It only check for it's >> existence in the startup process, so I guess we're safe here as well. > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Remote key sizes
That makes sense to me. -Original Message- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan T. Hjelm Sent: Tuesday, November 08, 2011 8:36 AM To: Open MPI Developers Subject: Re: [OMPI devel] Remote key sizes On Tue, 8 Nov 2011 06:36:03 -0800, Rolf vandeVaartwrote: >> george. >> >>PS: Regarding the hand-copy instead of the memcpy, we tried to avoid > using >>memcpy in performance critical codes, especially when we know the size of >>the data and the alignment. This relieves the compiler of adding ugly > intrinsics, >>allowing it to nicely pipeline to load/stores. Anyway, with both > approaches >>you will copy more data than needed for all BTLs except uGNI. > > I was looking at a case in a BTL I was working on where I actually need 64 > bytes (yes, bytes) as the remote key size as opposed to the current 16 > bytes (128 bits). > Not sure how I can handle that yet. (I assume configure is my friend, but > even in that case, all headers will need to carry around the extra data.) > I have been thinking about this a little bit. What I think should be done (and I am sure George will disagree) is to allow BTLs to define how long a segment is. The PML would then just memcpy the segments into the send buffer (instead of copying each member). For example mca_btl_base_segment_t would become: struct mca_btl_base_segment_t { size_t seg_len; }; since the pml needs the segment size (it does not need anything else). and then each btl would define its own segment like: struct mca_btl_ugni_segment_t { struct mca_btl_base_segment_t base; gni_mem_handle_t seg_key; }; and we would add: size_t btl_segment_len; to the mca_btl_base_module_t or the base frag so the pml knows how much it needs to copy. This design would address George's criticism of the length of the seg_key and also allow BTLs to do what they need to. It would require a memcpy but I disagree this would slow the critical path. Even if it does it would be relatively minor (i think) and the flexibility is worth more in the long run. -Nathan ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1411 / Virus Database: 2092/4003 - Release Date: 11/07/11
Re: [OMPI devel] debugger changes
On Nov 8, 2011, at 17:56 , Paul H. Hargrove wrote: > In theory, might a sufficiently smart compiler and linker eliminate some > MPIR_* variables after optimization? Even if a compiler can optimize out symbols from an application, I doubt they are allowed to apply the same optimization on libraries. As our MPIR_ symbols are defined as externally visible in libopen-rte.so (and some in libmpi.so), so I guess we're safe. However, this might be an issue when we compile statically … It is not an absolute proof, but I quickly checked with a static build and the MPIR_* symbols are still there with both gcc and icc. > If that could potentially be true, then perhaps the volatile qualifier would > prevent such a removal, which would break the existence check(s) by the > debugger? Just a thought. If we really want to have a clear answer to this, I guess we should ask a hard-core compiler guru about … george. > > -Paul > > On 11/8/2011 2:46 PM, George Bosilca wrote: >> This value is not even read by the debugger. It only check for it's >> existence in the startup process, so I guess we're safe here as well. > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > HPC Research Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25445
Ok, that makes sense. Is there a reason why the members were all set the be the same size? Maybe seg_key should be: union { uint8_t key8; uint16_t key16; uint32_t key32; uint64_t key64; struct { uint64_t value[2] } key128; }; -Nathan On Tue, 8 Nov 2011 17:22:48 -0500, George Bosilcawrote: > Elements in an array are always stored in the expected [increasing] order, > regardless of the endianess of the architecture. Moreover, due to the > alignment rules, all members in a union will start at the same address. > > It turns out there is no endianess conversion on the keys, so I suppose > both peers have to somehow reach a consensus outside the PML. > > george. > > On Nov 8, 2011, at 08:57 , Nathan T. Hjelm wrote: > >> Sure, I can do that. My only concern is with sending between hosts of >> different endianness. >> >> For example, if seg_key is 128 bits wide and the key32 is 64 bits then > we >> might run into this: >> >> Host 1: (big endian) >> Set seg_key.key32[0] = 0x >> >> would result in seg_key: 0x 0x 0x 0x >> >> Host 2: (little endian) >> Set seg_key.key32[0] = 0x1 >> >> would result in seg_key: 0x 0x 0x 0x >> >> If either host were to send the other one its seg_key and try to use the >> key32 they would get garbage. I haven't tested this case yet but I can > test >> on a PPE of RR later today. >> >> -Nathan >> >> On Tue, 8 Nov 2011 08:26:04 -0500, Jeff Squyres > wrote: >>> On Nov 7, 2011, at 9:48 PM, Nathan T. Hjelm wrote: >>> In retrospect I should have done a RFC for the 3rd change with a short timeout. At the time (operating on little sleep) it seemed like the >>> commits would have minimal impact. Please let me know if the commits have any negative impact. >>> >>> FWIW, I think I'd like to see a rollback of the increase of array sizes >> in >>> the seg_key union. They weren't necessary and might be slightly >>> misleading. >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger changes
In theory, might a sufficiently smart compiler and linker eliminate some MPIR_* variables after optimization? If that could potentially be true, then perhaps the volatile qualifier would prevent such a removal, which would break the existence check(s) by the debugger? Just a thought. -Paul On 11/8/2011 2:46 PM, George Bosilca wrote: This value is not even read by the debugger. It only check for it's existence in the startup process, so I guess we're safe here as well. -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] debugger changes
I guess people should check the commit before … No way the volatile will do any good here: -ORTE_DECLSPEC extern volatile char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; -ORTE_DECLSPEC extern volatile char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; +ORTE_DECLSPEC extern char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; +ORTE_DECLSPEC extern char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; This value is not even read by the debugger. It only check for it's existence in the startup process, so I guess we're safe here as well. -volatile int MPIR_i_am_starter = 0; +int MPIR_i_am_starter = 0; george. On Nov 8, 2011, at 17:43 , Ashley Pittman wrote: > > I think the volatiles are there to ensure the compiler doesn't optimise away > reads or function calls which has been a problem with this interface in the > past. > > On 8 Nov 2011, at 22:18, George Bosilca wrote: > >> MPIR_Breakpoint, as the name indicates, it is just a breakpoint used by the >> startup process or the MPI application to signal changes to the debugger. No >> return value, nothing more than a breakpoint. >> >> I wonder how the volatile got there, there is no such requirement on >> variables that cannot be changed during execution. >> >> george. >> >> On Nov 8, 2011, at 08:36 , Jeff Squyres wrote: >> >>> I think the only possible controversial change in this commit is changing >>> MPIR_Breakpoint() to return (void) instead of (void*). Oddly, I see that >>> MPICH2 has 2 different prototypes for MPIR_Breakpoint -- one returns >>> (void*), another returns (int). Assuming that MPICH2 works fine with the >>> debuggers, this suggests that the return is ignored by the tools -- as it >>> should be. >>> >>> I didn't check the volatile removals; I'm assuming that George got them >>> right. :-) >>> >>> I'll bet that this change does not cause any problems, but it might be >>> worth checking with the big 3+1: >>> >>> - DDT >>> - Totalview >>> - padb >>> - stat >>> >>> >>> On Nov 7, 2011, at 8:24 PM, bosi...@osl.iu.edu wrote: >>> Author: bosilca Date: 2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) New Revision: 25456 URL: https://svn.open-mpi.org/trac/ompi/changeset/25456 Log: Put the interface of our MPIR support in sync with the document accepted by the MPI Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). Text files modified: trunk/ompi/debuggers/debuggers.h |28 ++-- trunk/orte/mca/debugger/base/base.h |10 +- trunk/orte/mca/debugger/base/debugger_base_fns.c | 6 +++--- trunk/orte/mca/debugger/base/debugger_base_open.c | 6 +++--- 4 files changed, 25 insertions(+), 25 deletions(-) Modified: trunk/ompi/debuggers/debuggers.h == --- trunk/ompi/debuggers/debuggers.h (original) +++ trunk/ompi/debuggers/debuggers.h 2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) @@ -31,20 +31,20 @@ BEGIN_C_DECLS -/** - * Wait for a debugger if asked. - */ -extern void ompi_wait_for_debugger(void); - -/** - * Notify a debugger that we're about to abort - */ -extern void ompi_debugger_notify_abort(char *string); - -/** - * Breakpoint function for parallel debuggers. - */ -ORTE_DECLSPEC extern void *MPIR_Breakpoint(void); +/** + * Wait for a debugger if asked. + */ +extern void ompi_wait_for_debugger(void); + +/** + * Notify a debugger that we're about to abort + */ +extern void ompi_debugger_notify_abort(char *string); + +/** + * Breakpoint function for parallel debuggers. + */ +ORTE_DECLSPEC extern void MPIR_Breakpoint(void); END_C_DECLS Modified: trunk/orte/mca/debugger/base/base.h == --- trunk/orte/mca/debugger/base/base.h(original) +++ trunk/orte/mca/debugger/base/base.h2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) @@ -61,18 +61,18 @@ ORTE_DECLSPEC extern int MPIR_proctable_size; ORTE_DECLSPEC extern volatile int MPIR_being_debugged; ORTE_DECLSPEC extern volatile int MPIR_debug_state; -ORTE_DECLSPEC extern volatile int MPIR_i_am_starter; +ORTE_DECLSPEC extern int MPIR_i_am_starter; ORTE_DECLSPEC extern int MPIR_partial_attach_ok; -ORTE_DECLSPEC extern volatile char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; -ORTE_DECLSPEC extern volatile char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; +ORTE_DECLSPEC extern char
Re: [OMPI devel] debugger changes
I think the volatiles are there to ensure the compiler doesn't optimise away reads or function calls which has been a problem with this interface in the past. On 8 Nov 2011, at 22:18, George Bosilca wrote: > MPIR_Breakpoint, as the name indicates, it is just a breakpoint used by the > startup process or the MPI application to signal changes to the debugger. No > return value, nothing more than a breakpoint. > > I wonder how the volatile got there, there is no such requirement on > variables that cannot be changed during execution. > > george. > > On Nov 8, 2011, at 08:36 , Jeff Squyres wrote: > >> I think the only possible controversial change in this commit is changing >> MPIR_Breakpoint() to return (void) instead of (void*). Oddly, I see that >> MPICH2 has 2 different prototypes for MPIR_Breakpoint -- one returns >> (void*), another returns (int). Assuming that MPICH2 works fine with the >> debuggers, this suggests that the return is ignored by the tools -- as it >> should be. >> >> I didn't check the volatile removals; I'm assuming that George got them >> right. :-) >> >> I'll bet that this change does not cause any problems, but it might be worth >> checking with the big 3+1: >> >> - DDT >> - Totalview >> - padb >> - stat >> >> >> On Nov 7, 2011, at 8:24 PM, bosi...@osl.iu.edu wrote: >> >>> Author: bosilca >>> Date: 2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) >>> New Revision: 25456 >>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25456 >>> >>> Log: >>> Put the interface of our MPIR support in sync with the document accepted by >>> the MPI >>> Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). >>> >>> Text files modified: >>> trunk/ompi/debuggers/debuggers.h |28 >>> ++-- >>> trunk/orte/mca/debugger/base/base.h |10 +- >>> >>> trunk/orte/mca/debugger/base/debugger_base_fns.c | 6 +++--- >>> >>> trunk/orte/mca/debugger/base/debugger_base_open.c | 6 +++--- >>> >>> 4 files changed, 25 insertions(+), 25 deletions(-) >>> >>> Modified: trunk/ompi/debuggers/debuggers.h >>> == >>> --- trunk/ompi/debuggers/debuggers.h(original) >>> +++ trunk/ompi/debuggers/debuggers.h2011-11-07 20:24:16 EST (Mon, >>> 07 Nov 2011) >>> @@ -31,20 +31,20 @@ >>> >>> BEGIN_C_DECLS >>> >>> -/** >>> - * Wait for a debugger if asked. >>> - */ >>> -extern void ompi_wait_for_debugger(void); >>> - >>> -/** >>> - * Notify a debugger that we're about to abort >>> - */ >>> -extern void ompi_debugger_notify_abort(char *string); >>> - >>> -/** >>> - * Breakpoint function for parallel debuggers. >>> - */ >>> -ORTE_DECLSPEC extern void *MPIR_Breakpoint(void); >>> +/** >>> + * Wait for a debugger if asked. >>> + */ >>> +extern void ompi_wait_for_debugger(void); >>> + >>> +/** >>> + * Notify a debugger that we're about to abort >>> + */ >>> +extern void ompi_debugger_notify_abort(char *string); >>> + >>> +/** >>> + * Breakpoint function for parallel debuggers. >>> + */ >>> +ORTE_DECLSPEC extern void MPIR_Breakpoint(void); >>> >>> END_C_DECLS >>> >>> >>> Modified: trunk/orte/mca/debugger/base/base.h >>> == >>> --- trunk/orte/mca/debugger/base/base.h (original) >>> +++ trunk/orte/mca/debugger/base/base.h 2011-11-07 20:24:16 EST (Mon, >>> 07 Nov 2011) >>> @@ -61,18 +61,18 @@ >>> ORTE_DECLSPEC extern int MPIR_proctable_size; >>> ORTE_DECLSPEC extern volatile int MPIR_being_debugged; >>> ORTE_DECLSPEC extern volatile int MPIR_debug_state; >>> -ORTE_DECLSPEC extern volatile int MPIR_i_am_starter; >>> +ORTE_DECLSPEC extern int MPIR_i_am_starter; >>> ORTE_DECLSPEC extern int MPIR_partial_attach_ok; >>> -ORTE_DECLSPEC extern volatile char >>> MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; >>> -ORTE_DECLSPEC extern volatile char >>> MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; >>> +ORTE_DECLSPEC extern char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; >>> +ORTE_DECLSPEC extern char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; >>> ORTE_DECLSPEC extern volatile int MPIR_forward_output; >>> ORTE_DECLSPEC extern volatile int MPIR_forward_comm; >>> ORTE_DECLSPEC extern char MPIR_attach_fifo[MPIR_MAX_PATH_LENGTH]; >>> ORTE_DECLSPEC extern int MPIR_force_to_main; >>> >>> -typedef void* (*orte_debugger_breakpoint_fn_t)(void); >>> +typedef void (*orte_debugger_breakpoint_fn_t)(void); >>> >>> -ORTE_DECLSPEC void* MPIR_Breakpoint(void); >>> +ORTE_DECLSPEC void MPIR_Breakpoint(void); >>> >>> /* --- end MPICH/TotalView std debugger interface definitions */ >>> >>> >>> Modified: trunk/orte/mca/debugger/base/debugger_base_fns.c >>>
Re: [OMPI devel] make check fails for Intel 2011.6.233 (OpenMPI 1.4.3)
Larry, Thanks for following with us on this. I think your patch is cleaner than what we currently have in the trunk, so I went ahead and push it in the trunk (25461). I will request a push in 1.5 and 1.4 as well. Regards, george. On Nov 8, 2011, at 13:57 , Larry Baker wrote: > The good news is that the issue reported in R25290 is fixed in the latest > Intel compilers release (2011.7.256). The bad news is that both the > 2011.6.233 and 2011.7.256 releases identify themselves as V12.1.0 from the > command line. (I reported this bug to Intel already.) They can only be > reliably distinguished using the predefined __INTEL_COMPILER_BUILD_DATE > macro. I verified that the build dates for all three compilers we have -- > Linux, Mac OS X, and Windows -- are the same. > > I developed a more targeted patch (attached) for OpenMPI 1.4.3 > opal/mca/memory/ptmalloc2/malloc.c which disables vectorization for > _int_malloc() only if an Intel compiler with the 2011.6.233 release build > date is found (__INTEL_COMPILER_BUILD_DATE == 20110811). This patch could > presumably make its way into all the copies of > opal/mca/memory/ptmalloc2/malloc.c in the various versions of OpenMPI that > are still being maintained. > > Larry Baker > US Geological Survey > 650-329-5608 > ba...@usgs.gov > > On 17 Oct 2011, at 8:18 PM, George Bosilca wrote: > >> Larry, >> >> Sorry for not updating this thread. The issue was identified and fixed by >> Rainer in r25290 (https://svn.open-mpi.org/trac/ompi/changeset/25290). >> Please read the comments and the linked thread on the Intel forum for more >> info about. >> >> I couldn't find a trace of this being fixed in the 1.4 series, so I would >> wait upgrading until this issue gets resolved. >> >> Thanks, >> george. >> >> On Oct 17, 2011, at 23:00 , Larry Baker wrote: >> >>> George, >>> >>> I have not had time to look over the 1.4.3 make check failure for Intel >>> 2011.6.233 compilers. Have you? >>> >>> I had planned to get 1.4.3 compiled on all six of our compilers using the >>> latest compiler releases. I was putting off upgrading to 1.4.4 or 1.5.x >>> until after that to minimize the number of things that could go wrong. Do >>> you recommend otherwise? >>> >>> Larry Baker >>> US Geological Survey >>> 650-329-5608 >>> ba...@usgs.gov >>> >>> On 7 Oct 2011, at 6:46 PM, George Bosilca wrote: >>> The may_alias attribute was part of a forward-looking attribute checking, at a time where few compiler supported them. This explains why they are not widely used in the library itself. Moreover, as they do not affect the compilation itself (as your test highlights this is not the issue with the icc 2011.6.233 compiler), there is no urge to remove the may_alias support. I just got that particular version of the compiler installed on one of our machines. I'll give it a try over the weekend. george. On Oct 7, 2011, at 20:21 , Larry Baker wrote: > The test for the __may_alias_ attribute uses the following short code > snippet: > >> int * p_value __attribute__ ((__may_alias__)); >> int >> main () >> { >> >> ; >> return 0; >> } > > Indeed, for Intel 2011 compilers prior to 2011.6.233, this results in a > warning: > >> root@hydra openmpi-1.4.3]# module load compilers/intel/2011.5.220 >> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c >> may_alias_test.c(123): warning #1292: attribute "__may_alias__" ignored >> int * p_value __attribute__ ((__may_alias__)); >> ^ >> >> [root@hydra openmpi-1.4.3]# module unload compilers/intel/2011.5.220 > >> [root@hydra openmpi-1.4.3]# module load compilers/intel/2011.6.233 >> [root@hydra openmpi-1.4.3]# icc -c may_alias_test.c > > > I modified ./configure to force > >> ompi_cv___attribute__may_alias=0 > > > Then I compiled and tested the library. Unfortunately, the results were > exactly the same: > >> make check-TESTS >> make[3]: Entering directory >> `/state/partition1/root/src/openmpi-1.4.3/test/datatype' >> /bin/sh: line 4: 26326 Segmentation fault ${dir}$tst >> FAIL: checksum >> /bin/sh: line 4: 26359 Segmentation fault ${dir}$tst >> FAIL: position >> >> 2 of 2 tests failed >> Please report to http://www.open-mpi.org/community/help/ >> > > > I could not find any use of the may_alias attribute, other than in a > #define in opal/include/opal_config_bottom.h. Is > OMPI_HAVE_ATTRIBUTE_MAY_ALIAS just cruft that can be removed? > > Larry Baker > US Geological Survey > 650-329-5608 > ba...@usgs.gov > > On 7 Oct 2011, at 11:08 AM,
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25445
Elements in an array are always stored in the expected [increasing] order, regardless of the endianess of the architecture. Moreover, due to the alignment rules, all members in a union will start at the same address. It turns out there is no endianess conversion on the keys, so I suppose both peers have to somehow reach a consensus outside the PML. george. On Nov 8, 2011, at 08:57 , Nathan T. Hjelm wrote: > Sure, I can do that. My only concern is with sending between hosts of > different endianness. > > For example, if seg_key is 128 bits wide and the key32 is 64 bits then we > might run into this: > > Host 1: (big endian) > Set seg_key.key32[0] = 0x > > would result in seg_key: 0x 0x 0x 0x > > Host 2: (little endian) > Set seg_key.key32[0] = 0x1 > > would result in seg_key: 0x 0x 0x 0x > > If either host were to send the other one its seg_key and try to use the > key32 they would get garbage. I haven't tested this case yet but I can test > on a PPE of RR later today. > > -Nathan > > On Tue, 8 Nov 2011 08:26:04 -0500, Jeff Squyreswrote: >> On Nov 7, 2011, at 9:48 PM, Nathan T. Hjelm wrote: >> >>> In retrospect I should have done a RFC for the 3rd change with a short >>> timeout. At the time (operating on little sleep) it seemed like the >> commits >>> would have minimal impact. Please let me know if the commits have any >>> negative impact. >> >> FWIW, I think I'd like to see a rollback of the increase of array sizes > in >> the seg_key union. They weren't necessary and might be slightly >> misleading. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger changes
MPIR_Breakpoint, as the name indicates, it is just a breakpoint used by the startup process or the MPI application to signal changes to the debugger. No return value, nothing more than a breakpoint. I wonder how the volatile got there, there is no such requirement on variables that cannot be changed during execution. george. On Nov 8, 2011, at 08:36 , Jeff Squyres wrote: > I think the only possible controversial change in this commit is changing > MPIR_Breakpoint() to return (void) instead of (void*). Oddly, I see that > MPICH2 has 2 different prototypes for MPIR_Breakpoint -- one returns (void*), > another returns (int). Assuming that MPICH2 works fine with the debuggers, > this suggests that the return is ignored by the tools -- as it should be. > > I didn't check the volatile removals; I'm assuming that George got them > right. :-) > > I'll bet that this change does not cause any problems, but it might be worth > checking with the big 3+1: > > - DDT > - Totalview > - padb > - stat > > > On Nov 7, 2011, at 8:24 PM, bosi...@osl.iu.edu wrote: > >> Author: bosilca >> Date: 2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) >> New Revision: 25456 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/25456 >> >> Log: >> Put the interface of our MPIR support in sync with the document accepted by >> the MPI >> Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). >> >> Text files modified: >> trunk/ompi/debuggers/debuggers.h |28 >> ++-- >> trunk/orte/mca/debugger/base/base.h |10 +- >> >> trunk/orte/mca/debugger/base/debugger_base_fns.c | 6 +++--- >> >> trunk/orte/mca/debugger/base/debugger_base_open.c | 6 +++--- >> >> 4 files changed, 25 insertions(+), 25 deletions(-) >> >> Modified: trunk/ompi/debuggers/debuggers.h >> == >> --- trunk/ompi/debuggers/debuggers.h (original) >> +++ trunk/ompi/debuggers/debuggers.h 2011-11-07 20:24:16 EST (Mon, 07 Nov >> 2011) >> @@ -31,20 +31,20 @@ >> >> BEGIN_C_DECLS >> >> -/** >> - * Wait for a debugger if asked. >> - */ >> -extern void ompi_wait_for_debugger(void); >> - >> -/** >> - * Notify a debugger that we're about to abort >> - */ >> -extern void ompi_debugger_notify_abort(char *string); >> - >> -/** >> - * Breakpoint function for parallel debuggers. >> - */ >> -ORTE_DECLSPEC extern void *MPIR_Breakpoint(void); >> +/** >> + * Wait for a debugger if asked. >> + */ >> +extern void ompi_wait_for_debugger(void); >> + >> +/** >> + * Notify a debugger that we're about to abort >> + */ >> +extern void ompi_debugger_notify_abort(char *string); >> + >> +/** >> + * Breakpoint function for parallel debuggers. >> + */ >> +ORTE_DECLSPEC extern void MPIR_Breakpoint(void); >> >> END_C_DECLS >> >> >> Modified: trunk/orte/mca/debugger/base/base.h >> == >> --- trunk/orte/mca/debugger/base/base.h (original) >> +++ trunk/orte/mca/debugger/base/base.h 2011-11-07 20:24:16 EST (Mon, >> 07 Nov 2011) >> @@ -61,18 +61,18 @@ >> ORTE_DECLSPEC extern int MPIR_proctable_size; >> ORTE_DECLSPEC extern volatile int MPIR_being_debugged; >> ORTE_DECLSPEC extern volatile int MPIR_debug_state; >> -ORTE_DECLSPEC extern volatile int MPIR_i_am_starter; >> +ORTE_DECLSPEC extern int MPIR_i_am_starter; >> ORTE_DECLSPEC extern int MPIR_partial_attach_ok; >> -ORTE_DECLSPEC extern volatile char >> MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; >> -ORTE_DECLSPEC extern volatile char >> MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; >> +ORTE_DECLSPEC extern char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; >> +ORTE_DECLSPEC extern char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; >> ORTE_DECLSPEC extern volatile int MPIR_forward_output; >> ORTE_DECLSPEC extern volatile int MPIR_forward_comm; >> ORTE_DECLSPEC extern char MPIR_attach_fifo[MPIR_MAX_PATH_LENGTH]; >> ORTE_DECLSPEC extern int MPIR_force_to_main; >> >> -typedef void* (*orte_debugger_breakpoint_fn_t)(void); >> +typedef void (*orte_debugger_breakpoint_fn_t)(void); >> >> -ORTE_DECLSPEC void* MPIR_Breakpoint(void); >> +ORTE_DECLSPEC void MPIR_Breakpoint(void); >> >> /* --- end MPICH/TotalView std debugger interface definitions */ >> >> >> Modified: trunk/orte/mca/debugger/base/debugger_base_fns.c >> == >> --- trunk/orte/mca/debugger/base/debugger_base_fns.c (original) >> +++ trunk/orte/mca/debugger/base/debugger_base_fns.c 2011-11-07 20:24:16 EST >> (Mon, 07 Nov 2011) >> @@ -168,7 +168,7 @@ >> */ >>ORTE_PROGRESSED_WAIT(false, jdata->num_reported, jdata->num_procs); >> >> -(void) MPIR_Breakpoint(); >> +
[OMPI devel] Open MPI BOF
Folks, Wednesday November 15th at 12:15 PST, we will have an Open MPI BOF. We will have two guest speakers: Rolf vandeVaart from NVIDIA and Shinji Sumimoto from the K-computer. If you are at SC, you are all invited to participate to this annual event. Blend for a moment with our user community, and eventually answer particular questions raised during the discussion. In same time if you have any early work, any exciting features or any long-awaited fix for Open MPI and you want to make the user community aware about, this will be a perfect opportunity. Send me one (max two slides) by Monday COB, and I will include it in the presentation. Looking forward to meet you there, george.
Re: [OMPI devel] make check fails for Intel 2011.6.233 (OpenMPI 1.4.3)
The good news is that the issue reported in R25290 is fixed in the latest Intel compilers release (2011.7.256). The bad news is that both the 2011.6.233 and 2011.7.256 releases identify themselves as V12.1.0 from the command line. (I reported this bug to Intel already.) They can only be reliably distinguished using the predefined __INTEL_COMPILER_BUILD_DATE macro. I verified that the build dates for all three compilers we have -- Linux, Mac OS X, and Windows -- are the same.I developed a more targeted patch (attached) for OpenMPI 1.4.3 opal/mca/memory/ptmalloc2/malloc.c which disables vectorization for _int_malloc() only if an Intel compiler with the 2011.6.233 release build date is found (__INTEL_COMPILER_BUILD_DATE == 20110811). This patch could presumably make its way into all the copies of opal/mca/memory/ptmalloc2/malloc.c in the various versions of OpenMPI that are still being maintained. Larry BakerUS Geological Survey650-329-5608ba...@usgs.gov On 17 Oct 2011, at 8:18 PM, George Bosilca wrote:Larry,Sorry for not updating this thread. The issue was identified and fixed by Rainer in r25290 (https://svn.open-mpi.org/trac/ompi/changeset/25290). Please read the comments and the linked thread on the Intel forum for more info about.I couldn't find a trace of this being fixed in the 1.4 series, so I would wait upgrading until this issue gets resolved. Thanks, george.On Oct 17, 2011, at 23:00 , Larry Baker wrote:George,I have not had time to look over the 1.4.3 make check failure for Intel 2011.6.233 compilers. Have you?I had planned to get 1.4.3 compiled on all six of our compilers using the latest compiler releases. I was putting off upgrading to 1.4.4 or 1.5.x until after that to minimize the number of things that could go wrong. Do you recommend otherwise? Larry BakerUS Geological Survey650-329-5608ba...@usgs.gov On 7 Oct 2011, at 6:46 PM, George Bosilca wrote:The may_alias attribute was part of a forward-looking attribute checking, at a time where few compiler supported them. This explains why they are not widely used in the library itself. Moreover, as they do not affect the compilation itself (as your test highlights this is not the issue with the icc 2011.6.233 compiler), there is no urge to remove the may_alias support.I just got that particular version of the compiler installed on one of our machines. I'll give it a try over the weekend. george.On Oct 7, 2011, at 20:21 , Larry Baker wrote:The test for the __may_alias_ attribute uses the following short code snippet:int * p_value __attribute__ ((__may_alias__));intmain (){ ; return 0;}Indeed, for Intel 2011 compilers prior to 2011.6.233, this results in a warning:root@hydra openmpi-1.4.3]# module load compilers/intel/2011.5.220[root@hydra openmpi-1.4.3]# icc -c may_alias_test.c may_alias_test.c(123): warning #1292: attribute "__may_alias__" ignored int * p_value __attribute__ ((__may_alias__)); ^[root@hydra openmpi-1.4.3]# module unload compilers/intel/2011.5.220[root@hydra openmpi-1.4.3]# module load compilers/intel/2011.6.233[root@hydra openmpi-1.4.3]# icc -c may_alias_test.c I modified ./configure to forceompi_cv___attribute__may_alias=0Then I compiled and tested the library. Unfortunately, the results were exactly the same:make check-TESTSmake[3]: Entering directory `/state/partition1/root/src/openmpi-1.4.3/test/datatype'/bin/sh: line 4: 26326 Segmentation fault ${dir}$tstFAIL: checksum/bin/sh: line 4: 26359 Segmentation fault ${dir}$tstFAIL: position2 of 2 tests failedPlease report to http://www.open-mpi.org/community/help/I could not find any use of the may_alias attribute, other than in a #define in opal/include/opal_config_bottom.h. Is OMPI_HAVE_ATTRIBUTE_MAY_ALIAS just cruft that can be removed? Larry BakerUS Geological Survey650-329-5608ba...@usgs.gov On 7 Oct 2011, at 11:08 AM, Larry Baker wrote:I ran into a problem this past week trying to upgrade our OpenMPI 1.4.3 for the latest Intel 2011 compiler, 2011.6.233.make check fails with Segmentation Fault errors:[root@hydra openmpi-1.4.3]# tail -20 ../openmpi-1.4.3-check-intel.6.233.log/bin/sh ../../libtool --tag=CC --mode=link icc -DNDEBUG -g -O3 -finline-functions -fno-strict-aliasing -restrict -pthread -fvisibility=hidden -shared-intel -export-dynamic -shared-intel -o ddt_pack ddt_pack.o ../../ompi/libmpi.la -lnsl -lutil libtool: link: icc -DNDEBUG -g -O3 -finline-functions -fno-strict-aliasing -restrict -pthread -fvisibility=hidden -shared-intel -shared-intel -o .libs/ddt_pack ddt_pack.o -Wl,--export-dynamic ../../ompi/.libs/libmpi.so /usr/local/src/openmpi-1.4.3/orte/.libs/libopen-rte.so /usr/local/src/openmpi-1.4.3/opal/.libs/libopen-pal.so -ldl -lnsl -lutil -pthread -Wl,-rpath -Wl,/usr/local/libmake[3]: Leaving directory `/state/partition1/root/src/openmpi-1.4.3/test/datatype'make check-TESTSmake[3]:
Re: [OMPI devel] debugger confusion
On Nov 8, 2011, at 8:37 AM, Jeff Squyres wrote: > On Nov 8, 2011, at 10:25 AM, George Bosilca wrote: > >> However, based on what we have in the trunk today, Open MPI doesn't follow >> that document. As Ralph pinpointed it, the current version work with several >> tools (tv, stat, padb) as is, so that means the tools do not really follow >> that document either. > > This is not quite accurate. > > What the tools did over the past decade was make it so that they work with > the 5-6 MPIR variants that are out there. So yes, they work with OMPI, but > they work with the others who aren't quite "right," either. Because before > this, there was no central definition of "right." Agreed, though with a slight variation. Not only were the MPIs variant, but so are the tools. Some tools support various MPIR extensions and combinations of features, and others don't. That was the motivation behind some of us "pushing" the tool vendors to create a "standard" MPIR definition - it was to get all those extensions defined. The base stuff was always pretty common. And yes - I was one of those "twisting" their arms because I got tired of dealing with all the bloody tool interface variations, providing special code to support someone's pet extension, etc. > > The intent of the document was to make that central definition of "right" and > gradually have everyone move to it. AFAIK, all the tools have been updated > to work with the "right" definition of MPIR. While I think people may generally support some of the basic MPIR definitions, I haven't seen movement to supporting the full range - but maybe I've missed it. I haven't been following it as much over the last year or so. Even if they have, though, there is no way for us to control what release someone is using. So we still have to support both the old and the new variations for some time. > > Keep in mind that this is pretty much the same rationale as to why MPI still > supports functions like MPI_ATTR_SET: even though it's deprecated, there's > apps out there that still use it and will take a long time to adapt. Hence, > the tools will keep supporting the "old" / "not-quite-right" definitions of > MPIR for a long time. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger confusion
On Nov 8, 2011, at 10:25 AM, George Bosilca wrote: > However, based on what we have in the trunk today, Open MPI doesn't follow > that document. As Ralph pinpointed it, the current version work with several > tools (tv, stat, padb) as is, so that means the tools do not really follow > that document either. This is not quite accurate. What the tools did over the past decade was make it so that they work with the 5-6 MPIR variants that are out there. So yes, they work with OMPI, but they work with the others who aren't quite "right," either. Because before this, there was no central definition of "right." The intent of the document was to make that central definition of "right" and gradually have everyone move to it. AFAIK, all the tools have been updated to work with the "right" definition of MPIR. Keep in mind that this is pretty much the same rationale as to why MPI still supports functions like MPI_ATTR_SET: even though it's deprecated, there's apps out there that still use it and will take a long time to adapt. Hence, the tools will keep supporting the "old" / "not-quite-right" definitions of MPIR for a long time. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Remote key sizes
On Tue, 8 Nov 2011 06:36:03 -0800, Rolf vandeVaartwrote: >> george. >> >>PS: Regarding the hand-copy instead of the memcpy, we tried to avoid > using >>memcpy in performance critical codes, especially when we know the size of >>the data and the alignment. This relieves the compiler of adding ugly > intrinsics, >>allowing it to nicely pipeline to load/stores. Anyway, with both > approaches >>you will copy more data than needed for all BTLs except uGNI. > > I was looking at a case in a BTL I was working on where I actually need 64 > bytes (yes, bytes) as the remote key size as opposed to the current 16 > bytes (128 bits). > Not sure how I can handle that yet. (I assume configure is my friend, but > even in that case, all headers will need to carry around the extra data.) > I have been thinking about this a little bit. What I think should be done (and I am sure George will disagree) is to allow BTLs to define how long a segment is. The PML would then just memcpy the segments into the send buffer (instead of copying each member). For example mca_btl_base_segment_t would become: struct mca_btl_base_segment_t { size_t seg_len; }; since the pml needs the segment size (it does not need anything else). and then each btl would define its own segment like: struct mca_btl_ugni_segment_t { struct mca_btl_base_segment_t base; gni_mem_handle_t seg_key; }; and we would add: size_t btl_segment_len; to the mca_btl_base_module_t or the base frag so the pml knows how much it needs to copy. This design would address George's criticism of the length of the seg_key and also allow BTLs to do what they need to. It would require a memcpy but I disagree this would slow the critical path. Even if it does it would be relatively minor (i think) and the flexibility is worth more in the long run. -Nathan
Re: [OMPI devel] debugger confusion
On Nov 8, 2011, at 8:25 AM, George Bosilca wrote: > > On Nov 8, 2011, at 07:52 , Jeff Squyres wrote: > >> To be clear: that document simply standardizes what MPI implementations are >> supposed to provide in their MPIR implementation (prior to this, MPI >> implementations tended to have subtle differences between their MPIR >> implementations, which were a nightmare for the debugger/tool vendors). >> This document does *not* fix the scalability and other well-known issues >> with MPIR -- it just consolidates and standardizes the slightly-different >> versions of MPIR that were floating around out there. > > However, based on what we have in the trunk today, Open MPI doesn't follow > that document. As Ralph pinpointed it, the current version work with several > tools (tv, stat, padb) as is, so that means the tools do not really follow > that document either. What a mess … > > All the time we spent in the MPI Forum talking about the MPIR interface, and > look at the result ! Patience, patience - I look at the document as describing where people want to go, not a snapshot of where they already are. What will be interesting to see is how long it takes them to get there, how many of them will bother to do so, etc. And, of course, how we maintain integration with all of them as the migration progresses! > > george. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger confusion
On Nov 8, 2011, at 07:52 , Jeff Squyres wrote: > To be clear: that document simply standardizes what MPI implementations are > supposed to provide in their MPIR implementation (prior to this, MPI > implementations tended to have subtle differences between their MPIR > implementations, which were a nightmare for the debugger/tool vendors). This > document does *not* fix the scalability and other well-known issues with MPIR > -- it just consolidates and standardizes the slightly-different versions of > MPIR that were floating around out there. However, based on what we have in the trunk today, Open MPI doesn't follow that document. As Ralph pinpointed it, the current version work with several tools (tv, stat, padb) as is, so that means the tools do not really follow that document either. What a mess … All the time we spent in the MPI Forum talking about the MPIR interface, and look at the result ! george.
[OMPI devel] Remote key sizes
> george. > >PS: Regarding the hand-copy instead of the memcpy, we tried to avoid using >memcpy in performance critical codes, especially when we know the size of >the data and the alignment. This relieves the compiler of adding ugly >intrinsics, >allowing it to nicely pipeline to load/stores. Anyway, with both approaches >you will copy more data than needed for all BTLs except uGNI. I was looking at a case in a BTL I was working on where I actually need 64 bytes (yes, bytes) as the remote key size as opposed to the current 16 bytes (128 bits). Not sure how I can handle that yet. (I assume configure is my friend, but even in that case, all headers will need to carry around the extra data.) Rolf > >On Nov 7, 2011, at 21:48 , Nathan T. Hjelm wrote: > >> >> >> On Mon, 7 Nov 2011 17:18:42 -0500, George Bosilca >>>> wrote: >>> A little bit of history: >>> >>> 1. r25305: added 2 atomic operations to OPAL. However, they only >>> exists >> on >>> amd64 and are only used in the vader BTL, which I assume only >>> supports amd64. >> >> Two things: >> - The atomic is a new feature that has no impact on existing code. It >> can also be implemented on Intel but we have not tested it (yet). >> - The atomic was pushed to support lock-free queues in the Vader BTL. >> Vader does not need the atomics and can use an atomic lock lock but I >> see higher latencies when using locks. >> >> Why would this change (that has no impact on any other code) need an >RFC? >> >>> 2. r25334: The seg_key union got a new member ptr. This member is >>> solely used in the vader BTL, as all other BTL use a compiler trick >>> to convert a pointer to a 64 bits. >> >> I am actually going to remove that member. I prefer the use of >> uintptr_t over casting to a uint64_t but it has no real benefit and >> possibly a pitfall due to its platform dependent size. >> >> But the member has, like the atomic, no impact on any exiting code. It >> does not change the size of the seg_key and was only used by Vader. >> Why would this change have required an RFC? >> >>> 3. r25445: All members of the seg_key union got friends, because Cray >> dare >>> to set their keys at 128 bits long. However a quick find . -name >>> "*.[ch]" -exec grep -Hn seg_key {} \; | grep "\[1\]" >>> indicates that no BTL is using 128 bits keys. Code has been added to >>> all PMLs, but I guess they just copy empty data. >> >> For now they copy empty data but in the near future (as I have said) >> we will need to bits for the ugni btl (Cray XE Gemini). I pushed this >> code to prepare for pushing ugni. >> >> Also, you might be a good person to ask: Why do we copy each member of >> a segment individually in the PMLs? Wouldn't it be faster to do a >> memcpy? If we were using a memcpy I would not have had to make any >change to the pmls. >> >>> What I see is a pattern of commits that can have been dealt with >>> differently. None had an RFC, and most of them are not even used. >> >> I think you are reaching a little here. I pushed several changes over >> a period of a month. The first two are not related to the third which >> is the only one that could have any impact to existing code and might >> require an RFC. >> >> In retrospect I should have done a RFC for the 3rd change with a short >> timeout. At the time (operating on little sleep) it seemed like the >> commits would have minimal impact. Please let me know if the commits >> have any negative impact. >> >> -Nathan >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >___ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel --- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. ---
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25445
Sure, I can do that. My only concern is with sending between hosts of different endianness. For example, if seg_key is 128 bits wide and the key32 is 64 bits then we might run into this: Host 1: (big endian) Set seg_key.key32[0] = 0x would result in seg_key: 0x 0x 0x 0x Host 2: (little endian) Set seg_key.key32[0] = 0x1 would result in seg_key: 0x 0x 0x 0x If either host were to send the other one its seg_key and try to use the key32 they would get garbage. I haven't tested this case yet but I can test on a PPE of RR later today. -Nathan On Tue, 8 Nov 2011 08:26:04 -0500, Jeff Squyreswrote: > On Nov 7, 2011, at 9:48 PM, Nathan T. Hjelm wrote: > >> In retrospect I should have done a RFC for the 3rd change with a short >> timeout. At the time (operating on little sleep) it seemed like the > commits >> would have minimal impact. Please let me know if the commits have any >> negative impact. > > FWIW, I think I'd like to see a rollback of the increase of array sizes in > the seg_key union. They weren't necessary and might be slightly > misleading. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] debugger changes
I think the only possible controversial change in this commit is changing MPIR_Breakpoint() to return (void) instead of (void*). Oddly, I see that MPICH2 has 2 different prototypes for MPIR_Breakpoint -- one returns (void*), another returns (int). Assuming that MPICH2 works fine with the debuggers, this suggests that the return is ignored by the tools -- as it should be. I didn't check the volatile removals; I'm assuming that George got them right. :-) I'll bet that this change does not cause any problems, but it might be worth checking with the big 3+1: - DDT - Totalview - padb - stat On Nov 7, 2011, at 8:24 PM, bosi...@osl.iu.edu wrote: > Author: bosilca > Date: 2011-11-07 20:24:16 EST (Mon, 07 Nov 2011) > New Revision: 25456 > URL: https://svn.open-mpi.org/trac/ompi/changeset/25456 > > Log: > Put the interface of our MPIR support in sync with the document accepted by > the MPI > Forum (http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf). > > Text files modified: > trunk/ompi/debuggers/debuggers.h |28 > ++-- > trunk/orte/mca/debugger/base/base.h |10 +- > > trunk/orte/mca/debugger/base/debugger_base_fns.c | 6 +++--- > > trunk/orte/mca/debugger/base/debugger_base_open.c | 6 +++--- > > 4 files changed, 25 insertions(+), 25 deletions(-) > > Modified: trunk/ompi/debuggers/debuggers.h > == > --- trunk/ompi/debuggers/debuggers.h (original) > +++ trunk/ompi/debuggers/debuggers.h 2011-11-07 20:24:16 EST (Mon, 07 Nov > 2011) > @@ -31,20 +31,20 @@ > > BEGIN_C_DECLS > > -/** > - * Wait for a debugger if asked. > - */ > -extern void ompi_wait_for_debugger(void); > - > -/** > - * Notify a debugger that we're about to abort > - */ > -extern void ompi_debugger_notify_abort(char *string); > - > -/** > - * Breakpoint function for parallel debuggers. > - */ > -ORTE_DECLSPEC extern void *MPIR_Breakpoint(void); > +/** > + * Wait for a debugger if asked. > + */ > +extern void ompi_wait_for_debugger(void); > + > +/** > + * Notify a debugger that we're about to abort > + */ > +extern void ompi_debugger_notify_abort(char *string); > + > +/** > + * Breakpoint function for parallel debuggers. > + */ > +ORTE_DECLSPEC extern void MPIR_Breakpoint(void); > > END_C_DECLS > > > Modified: trunk/orte/mca/debugger/base/base.h > == > --- trunk/orte/mca/debugger/base/base.h (original) > +++ trunk/orte/mca/debugger/base/base.h 2011-11-07 20:24:16 EST (Mon, > 07 Nov 2011) > @@ -61,18 +61,18 @@ > ORTE_DECLSPEC extern int MPIR_proctable_size; > ORTE_DECLSPEC extern volatile int MPIR_being_debugged; > ORTE_DECLSPEC extern volatile int MPIR_debug_state; > -ORTE_DECLSPEC extern volatile int MPIR_i_am_starter; > +ORTE_DECLSPEC extern int MPIR_i_am_starter; > ORTE_DECLSPEC extern int MPIR_partial_attach_ok; > -ORTE_DECLSPEC extern volatile char > MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; > -ORTE_DECLSPEC extern volatile char > MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; > +ORTE_DECLSPEC extern char MPIR_executable_path[MPIR_MAX_PATH_LENGTH]; > +ORTE_DECLSPEC extern char MPIR_server_arguments[MPIR_MAX_ARG_LENGTH]; > ORTE_DECLSPEC extern volatile int MPIR_forward_output; > ORTE_DECLSPEC extern volatile int MPIR_forward_comm; > ORTE_DECLSPEC extern char MPIR_attach_fifo[MPIR_MAX_PATH_LENGTH]; > ORTE_DECLSPEC extern int MPIR_force_to_main; > > -typedef void* (*orte_debugger_breakpoint_fn_t)(void); > +typedef void (*orte_debugger_breakpoint_fn_t)(void); > > -ORTE_DECLSPEC void* MPIR_Breakpoint(void); > +ORTE_DECLSPEC void MPIR_Breakpoint(void); > > /* --- end MPICH/TotalView std debugger interface definitions */ > > > Modified: trunk/orte/mca/debugger/base/debugger_base_fns.c > == > --- trunk/orte/mca/debugger/base/debugger_base_fns.c (original) > +++ trunk/orte/mca/debugger/base/debugger_base_fns.c 2011-11-07 20:24:16 EST > (Mon, 07 Nov 2011) > @@ -168,7 +168,7 @@ > */ > ORTE_PROGRESSED_WAIT(false, jdata->num_reported, jdata->num_procs); > > -(void) MPIR_Breakpoint(); > +MPIR_Breakpoint(); > > /* send a message to rank=0 to release it */ > OBJ_CONSTRUCT(, opal_buffer_t); /* don't need anything in this */ > @@ -186,7 +186,7 @@ > /* > * Breakpoint function for parallel debuggers > */ > -void *MPIR_Breakpoint(void) > +void MPIR_Breakpoint(void) > { > -return NULL; > +return; > } > > Modified: trunk/orte/mca/debugger/base/debugger_base_open.c > == > ---
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25445
On Nov 7, 2011, at 9:48 PM, Nathan T. Hjelm wrote: > In retrospect I should have done a RFC for the 3rd change with a short > timeout. At the time (operating on little sleep) it seemed like the commits > would have minimal impact. Please let me know if the commits have any > negative impact. FWIW, I think I'd like to see a rollback of the increase of array sizes in the seg_key union. They weren't necessary and might be slightly misleading. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] debugger confusion
On Nov 7, 2011, at 8:34 PM, Ralph Castain wrote: > Best guess: from what I've seen, most debuggers don't seem to conform to what > the MPI Forum has "accepted". It doesn't appear that the vendors and debugger > developers pay too much attention to that document, possibly because it (a) > came after the debuggers were developed, and (b) still doesn't seem to be > widely adopted. Keep in mind that the debugger/tool authors essentially wrote the document, with some guidance from the Forum. The Forum saw the wisdom in making it an "official" MPI Forum document so that it would carry some weight, and voted to do so. That document is not actually part of any MPI standard document for multiple reasons; here's two: 1. MPIR has a bunch of known problems which no one is currently interested in fixing (e.g., scalability) 2. No one wanted to *mandate* the MPIR interface in an MPI implementation It is therefore a standalone document that, since it became an "official" Forum document, is available on mpi-forum.org: http://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf To be clear: that document simply standardizes what MPI implementations are supposed to provide in their MPIR implementation (prior to this, MPI implementations tended to have subtle differences between their MPIR implementations, which were a nightmare for the debugger/tool vendors). This document does *not* fix the scalability and other well-known issues with MPIR -- it just consolidates and standardizes the slightly-different versions of MPIR that were floating around out there. > I'd suggest being a little careful about making changes without consulting > people who use TV and "stat", at least - those are the ones most recently > tested. Fair enough. Moving towards what was specified in that document would probably be a good thing, though, since that document *is* the currently accepted version of how MPIR is supposed to work and was essentially written *by* the tool vendors. Of course, appropriate testing with various debuggers and tools out there should be a given -- current versions of DDT, Totalview, and padb are probably the 3 most obvious ones with which to test; others have mentioned some "stat," too. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] debugger confusion
On Nov 8, 2011, at 4:48 AM, Ashley Pittman wrote: > I agree that it's not clear this, I don't think this spec is well understood > by anyone, indeed it wasn't originally written with the intention of becoming > a specification at all. I've looked at it a couple of times but never used > this aspect of it, padb (and I believe stat is the same) don't ever launch > jobs under control of the debugger, simply attach to an already existing job > which means I've been able to ignore this part of the spec in padb entirely. > This was the point I was trying to communicate earlier, without apparent success. I don't think this document can be treated like a spec at this point, nor should we assume that debugger "vendors" already support it. It isn't clear to me that any real consensus understanding of the document even exists at this time. Hence, I really suggest caution about making changes to our interface code without people with access to the various debuggers having a chance to test the idea. It took some degree of pain to get this all working, especially to support those debuggers that dynamically attach, and I for one would rather not go thru it again just because someone decided to interpret the document a particular way. Nathan/Sam: can you please test stat against the trunk and see if it still works? Ashley: ditto with padb, when you have time, would be most appreciated. Ralph
Re: [OMPI devel] Segfault in odls_fork_local_procs() for some values of npersocket
Looks fine to me - CMR filed. Thanks! On Nov 8, 2011, at 1:01 AM, nadia.derbey wrote: > Hi, > > In v1.5, when mpirun is called with both the "-bind-to-core" and > "-npersocket" options, and the npersocket value leads to less procs than > sockets allocated on one node, we get a segfault > > Testing environment: > openmpi v1.5 > 2 nodes with 4 8-cores sockets each > mpirun -n 10 -bind-to-core -npersocket 2 > > I was expecting to get: > . ranks 0-1 : node 0 - socket 0 > . ranks 2-3 : node 0 - socket 1 > . ranks 4-5 : node 0 - socket 2 > . ranks 6-7 : node 0 - socket 3 > . ranks 8-9 : node 1 - socket 0 > > Instead of that, everything worked fine on node 0, and I got a segfault > on node 1, with a stack that looks like: > > [derbeyn@berlin18 ~]$ mpirun --host berlin18,berlin26 -n 10 > -bind-to-core -npersocket 2 sleep 900 > [berlin26:21531] *** Process received signal *** > [berlin26:21531] Signal: Floating point exception (8) > [berlin26:21531] Signal code: Integer divide-by-zero (1) > [berlin26:21531] Failing at address: 0x7fed13731d63 > [berlin26:21531] [ 0] /lib64/libpthread.so.0(+0xf490) [0x7fed15327490] > [berlin26:21531] > [ 1] > /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x2d63) > [0x7fed13731d63] > [berlin26:21531] > [ 2] > /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_odls_base_default_launch_local+0xaf3) > [0x7fed15e1fe73] > [berlin26:21531] > [ 3] > /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x1d10) > [0x7fed13730d10] > [berlin26:21531] > [ 4] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x3804d) > [0x7fed15e1004d] > [berlin26:21531] > [ 5] > /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon_cmd_processor+0x4aa) > [0x7fed15e1209a] > [berlin26:21531] > [ 6] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x74ee8) > [0x7fed15e4cee8] > [berlin26:21531] > [ 7] > /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon+0x8d8) > [0x7fed15e0f268] > [berlin26:21531] [ 8] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted() > [0x4008c6] > [berlin26:21531] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) > [0x7fed14fa7c9d] > [berlin26:21531] [10] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted() > [0x400799] > [berlin26:21531] *** End of error message *** > > The reason for this issue is that the npersocket value is taken into > account during the very first phase of mpirun (rmaps/load_balance) to > claim the slots on each node: > npersocket() (in rmaps/load_balance/rmaps_lb.c) claims > . 8 slots on node 0 (4 sockets * 2 persocket) > . 2 slots on node 1 (10 total ranks - 8 already claimed) > > But when we come to odls_default_fork_local_proc() (in > odls/default/odls_default_module.c) npersocket is actually recomputed. > Everything works fine on node 0. But on node 1, we have: > . jobdat->policy has both ORTE_BIND_TO_CORE and ORTE_MAPPING_NPERXXX > . npersocket is recomputed the following way: > npersocket = jobdat->num_local_procs/orte_odls_globals.num_sockets >= 2 / 4 = 0 > . later on, when the starting point is computed: > logical_cpu = (lrank % npersocket) * jobdat->cpus_per_rank; > we get the divide-by-zero exception. > > The problem comes, in my mind, from the fact we are recomputing the > npersocket on the local nodes instead of storing it in the jobdat > structure (as it is done today for the policy, the cpus_per_rank, the > stride,...). > Recomputing this value leads either to the segfault I got, or even to > wrong mappings: if we had had 4 slots claimed on node 1, the result > would have been 1 rank per socket (since we have 4-sockets nodes) > instead of 2 ranks on the first 2 sockets. > > The attached patch is a fix proposal implementing my suggestion of > storing the npersocket into the jobdat. > > This patch applies on v1.5. Waiting for your comments... > > Regards, > Nadia > > -- > Nadia Derbey > <001_dont_recompute_npersocket_on_local_nodes.patch>___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] debugger confusion
On 8 Nov 2011, at 00:59, George Bosilca wrote: > A started process is defined as being our mpirun. In Open MPI > MPIR_partial_attach_ok is defined, so the tool will suppose that we provide a > means to synchronize the processes not based on MPIR_debug_gate. Therefore > only one behavior if acceptable based on the text above: no MPIR_debug_gate=1 > should be issued by the tool. Open MPI itself (Via ORTE) is not the only possible launch mechanism for Open MPI jobs, Slurm is the only other tool I can think of of the top of my head that can do it but I wouldn't be surprised if there are others. At the time the document was written it was assumed that the MPI library and resource manager/job launcher were so closely integrated they could be assumed to be part of the same software. > However, in the ompi_debuggers.c around line 226, we have an if that switch > between the two acceptable behavior (MPIR_debug_gate or own mechanism) based > on the fact that we are a standalone (slurmd or generic) or not. As generic > is the ess loaded in most of the cases, I can't figure out how this works if > the MPIR specification document has to be trusted. Unless the library can guarantee that the starter process has MPIR_partial_attach_ok the only safe thing it can do it wait on MPIR_debug_gate, the only way the library can make any guarantees about mpirun is if it's launched from orted. I agree that it's not clear this, I don't think this spec is well understood by anyone, indeed it wasn't originally written with the intention of becoming a specification at all. I've looked at it a couple of times but never used this aspect of it, padb (and I believe stat is the same) don't ever launch jobs under control of the debugger, simply attach to an already existing job which means I've been able to ignore this part of the spec in padb entirely. Ashley.
[OMPI devel] Segfault in odls_fork_local_procs() for some values of npersocket
Hi, In v1.5, when mpirun is called with both the "-bind-to-core" and "-npersocket" options, and the npersocket value leads to less procs than sockets allocated on one node, we get a segfault Testing environment: openmpi v1.5 2 nodes with 4 8-cores sockets each mpirun -n 10 -bind-to-core -npersocket 2 I was expecting to get: . ranks 0-1 : node 0 - socket 0 . ranks 2-3 : node 0 - socket 1 . ranks 4-5 : node 0 - socket 2 . ranks 6-7 : node 0 - socket 3 . ranks 8-9 : node 1 - socket 0 Instead of that, everything worked fine on node 0, and I got a segfault on node 1, with a stack that looks like: [derbeyn@berlin18 ~]$ mpirun --host berlin18,berlin26 -n 10 -bind-to-core -npersocket 2 sleep 900 [berlin26:21531] *** Process received signal *** [berlin26:21531] Signal: Floating point exception (8) [berlin26:21531] Signal code: Integer divide-by-zero (1) [berlin26:21531] Failing at address: 0x7fed13731d63 [berlin26:21531] [ 0] /lib64/libpthread.so.0(+0xf490) [0x7fed15327490] [berlin26:21531] [ 1] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x2d63) [0x7fed13731d63] [berlin26:21531] [ 2] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_odls_base_default_launch_local+0xaf3) [0x7fed15e1fe73] [berlin26:21531] [ 3] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/openmpi/mca_odls_default.so(+0x1d10) [0x7fed13730d10] [berlin26:21531] [ 4] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x3804d) [0x7fed15e1004d] [berlin26:21531] [ 5] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon_cmd_processor+0x4aa) [0x7fed15e1209a] [berlin26:21531] [ 6] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(+0x74ee8) [0x7fed15e4cee8] [berlin26:21531] [ 7] /home_nfs/derbeyn/DISTS/openmpi-v1.5/lib/libopen-rte.so.3(orte_daemon+0x8d8) [0x7fed15e0f268] [berlin26:21531] [ 8] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted() [0x4008c6] [berlin26:21531] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd) [0x7fed14fa7c9d] [berlin26:21531] [10] /home_nfs/derbeyn/DISTS/openmpi-v1.5/bin/orted() [0x400799] [berlin26:21531] *** End of error message *** The reason for this issue is that the npersocket value is taken into account during the very first phase of mpirun (rmaps/load_balance) to claim the slots on each node: npersocket() (in rmaps/load_balance/rmaps_lb.c) claims . 8 slots on node 0 (4 sockets * 2 persocket) . 2 slots on node 1 (10 total ranks - 8 already claimed) But when we come to odls_default_fork_local_proc() (in odls/default/odls_default_module.c) npersocket is actually recomputed. Everything works fine on node 0. But on node 1, we have: . jobdat->policy has both ORTE_BIND_TO_CORE and ORTE_MAPPING_NPERXXX . npersocket is recomputed the following way: npersocket = jobdat->num_local_procs/orte_odls_globals.num_sockets = 2 / 4 = 0 . later on, when the starting point is computed: logical_cpu = (lrank % npersocket) * jobdat->cpus_per_rank; we get the divide-by-zero exception. The problem comes, in my mind, from the fact we are recomputing the npersocket on the local nodes instead of storing it in the jobdat structure (as it is done today for the policy, the cpus_per_rank, the stride,...). Recomputing this value leads either to the segfault I got, or even to wrong mappings: if we had had 4 slots claimed on node 1, the result would have been 1 rank per socket (since we have 4-sockets nodes) instead of 2 ranks on the first 2 sockets. The attached patch is a fix proposal implementing my suggestion of storing the npersocket into the jobdat. This patch applies on v1.5. Waiting for your comments... Regards, Nadia -- Nadia Derbey npersocket should not be recomputed in odls_default_fork_local_procs: segfault might occur in some particular cases diff -r ce3749a94a9e orte/mca/odls/base/odls_base_default_fns.c --- a/orte/mca/odls/base/odls_base_default_fns.c Fri Nov 04 13:31:18 2011 +0100 +++ b/orte/mca/odls/base/odls_base_default_fns.c Fri Nov 04 13:55:00 2011 +0100 @@ -352,6 +352,12 @@ int orte_odls_base_default_get_add_procs return rc; } +/* pack the npersocket for this job */ +if (ORTE_SUCCESS != (rc = opal_dss.pack(data, >npersocket, 1, OPAL_INT32))) { +ORTE_ERROR_LOG(rc); +return rc; +} + /* pack the cpus_per_rank for this job */ if (ORTE_SUCCESS != (rc = opal_dss.pack(data, >cpus_per_rank, 1, OPAL_INT16))) { ORTE_ERROR_LOG(rc); @@ -809,6 +815,12 @@ int orte_odls_base_default_construct_chi ORTE_ERROR_LOG(rc); goto REPORT_ERROR; } +/* unpack the npersocket for the job */ +cnt=1; +if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >npersocket, , OPAL_INT32))) { +ORTE_ERROR_LOG(rc); +goto REPORT_ERROR; +} /* unpack the cpus/rank for the job */ cnt=1; if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >cpus_per_rank, , OPAL_INT16))) { diff -r ce3749a94a9e