Re: [OMPI devel] SM initialization race condition
A little google searching, and the best I can find is that memset is part of the C89/C90 standard, while bzero isn't. Thus memset would/should be supported even on non-POSIX systems. Also, the opengroup claims that the bzero is LEGACY and "This function may be withdrawn in a future version." http://www.opengroup.org/onlinepubs/95399/functions/bzero.html However, who actually thinks that bzero would ever be removed? And yes, there is a hyphen in ;-)Now back to productive work for me. On Thu, Aug 21, 2008 at 9:39 AM, Tim Mattoxwrote: > Actually, bzero() is POSIX. Here is the history section of the bzero man page > on Mac OS X 10.4: > > A bzero() function appeared in 4.3BSD. Its prototype existed previously > in before it was moved to for IEEE Std 1003.1-2001 > (``POSIX.1'') compliance. > > Hmmm, but the Linux man page says it is deprecated, and says we should > use memset. > Wish it explained why... so I think we are safe to just use bzero, > > On Thu, Aug 21, 2008 at 9:32 AM, Jeff Squyres wrote: >> IIRC, bzero is a gnu-ism. We should probably use memset instead. >> >> >> On Aug 21, 2008, at 5:40 AM, George Bosilca wrote: >> >>> Terry, >>> >>> We use the feature defined by POSIX mmap where the area should be >>> zero-filled when the file length is extended. What OS you're using when you >>> see such problems ? >>> >>> Just in case, here is a patch that set the beginning of the mmaped region >>> to zero, in case this is not done automatically. As in most cases this is an >>> unnecessary overhead, we should find the cases where we really need this, >>> and possibly conditionally compile it. >>> >>> Index: ompi/mca/common/sm/common_sm_mmap.c >>> === >>> --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) >>> +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) >>> @@ -163,6 +163,7 @@ >>> >>>/* initialize the segment - only the first process >>> to open the file */ >>> +bzero( map->data_addr, size ); >>>mem_offset = map->data_addr - (unsigned char >>> *)map->map_seg; >>>map->map_seg->seg_offset = mem_offset; >>>map->map_seg->seg_size = size - mem_offset; >>> >>> george. >>> >>> On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: >>> I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > > -- > Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ > tmat...@gmail.com || timat...@open-mpi.org > I'm a bright... http://www.the-brights.net/ > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] SM initialization race condition
George Bosilca wrote: Terry, We use the feature defined by POSIX mmap where the area should be zero-filled when the file length is extended. What OS you're using when you see such problems ? So far I've only tested this on Solaris. We'll try out the bzero to see if this goes away. --td Just in case, here is a patch that set the beginning of the mmaped region to zero, in case this is not done automatically. As in most cases this is an unnecessary overhead, we should find the cases where we really need this, and possibly conditionally compile it. Index: ompi/mca/common/sm/common_sm_mmap.c === --- ompi/mca/common/sm/common_sm_mmap.c(revision 19377) +++ ompi/mca/common/sm/common_sm_mmap.c(working copy) @@ -163,6 +163,7 @@ /* initialize the segment - only the first process to open the file */ +bzero( map->data_addr, size ); mem_offset = map->data_addr - (unsigned char *)map->map_seg; map->map_seg->seg_offset = mem_offset; map->map_seg->seg_size = size - mem_offset; george. On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM initialization race condition
Actually, bzero() is POSIX. Here is the history section of the bzero man page on Mac OS X 10.4: A bzero() function appeared in 4.3BSD. Its prototype existed previously in before it was moved to for IEEE Std 1003.1-2001 (``POSIX.1'') compliance. Hmmm, but the Linux man page says it is deprecated, and says we should use memset. Wish it explained why... so I think we are safe to just use bzero, On Thu, Aug 21, 2008 at 9:32 AM, Jeff Squyreswrote: > IIRC, bzero is a gnu-ism. We should probably use memset instead. > > > On Aug 21, 2008, at 5:40 AM, George Bosilca wrote: > >> Terry, >> >> We use the feature defined by POSIX mmap where the area should be >> zero-filled when the file length is extended. What OS you're using when you >> see such problems ? >> >> Just in case, here is a patch that set the beginning of the mmaped region >> to zero, in case this is not done automatically. As in most cases this is an >> unnecessary overhead, we should find the cases where we really need this, >> and possibly conditionally compile it. >> >> Index: ompi/mca/common/sm/common_sm_mmap.c >> === >> --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) >> +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) >> @@ -163,6 +163,7 @@ >> >>/* initialize the segment - only the first process >> to open the file */ >> +bzero( map->data_addr, size ); >>mem_offset = map->data_addr - (unsigned char >> *)map->map_seg; >>map->map_seg->seg_offset = mem_offset; >>map->map_seg->seg_size = size - mem_offset; >> >> george. >> >> On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: >> >>> I've been seeing an intermittent (once every 4 hours looping on a quick >>> initialization program) segv with the following stack trace. >>> >>> =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = >>> 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in >>> "btl_sm.c" >>> [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = >>> 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" >>> [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in >>> "pml_ob1.c" >>> [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, >>> provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" >>> [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line >>> 90 in "pinit.c" >>> [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" >>> >>> I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains >>> uninitialized data causes the loop on line 504 in btl_sm.c to think that a >>> remote rank has set its fifo address. >>> >>> Has anyone else seen the above happening? >>> >>> --td >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] SM initialization race condition
bzero() function conforms to IEEE Std 1003.1-2001 (``POSIX.1'') memset() function conforms to ISO/IEC 9899:1990 (``ISO C90'') Both functions are in the libc, so it's definitively difficult to see which one is better. george. On Aug 21, 2008, at 3:32 PM, Jeff Squyres wrote: IIRC, bzero is a gnu-ism. We should probably use memset instead. On Aug 21, 2008, at 5:40 AM, George Bosilca wrote: Terry, We use the feature defined by POSIX mmap where the area should be zero-filled when the file length is extended. What OS you're using when you see such problems ? Just in case, here is a patch that set the beginning of the mmaped region to zero, in case this is not done automatically. As in most cases this is an unnecessary overhead, we should find the cases where we really need this, and possibly conditionally compile it. Index: ompi/mca/common/sm/common_sm_mmap.c === --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) @@ -163,6 +163,7 @@ /* initialize the segment - only the first process to open the file */ +bzero( map->data_addr, size ); mem_offset = map->data_addr - (unsigned char *)map- >map_seg; map->map_seg->seg_offset = mem_offset; map->map_seg->seg_size = size - mem_offset; george. On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] SM initialization race condition
bzero is not a gnu-ism -- it's in POSIX.1. Either bzero or memset is correct and used throughout OMPI. Brian On Thu, 21 Aug 2008, Jeff Squyres wrote: IIRC, bzero is a gnu-ism. We should probably use memset instead. On Aug 21, 2008, at 5:40 AM, George Bosilca wrote: Terry, We use the feature defined by POSIX mmap where the area should be zero-filled when the file length is extended. What OS you're using when you see such problems ? Just in case, here is a patch that set the beginning of the mmaped region to zero, in case this is not done automatically. As in most cases this is an unnecessary overhead, we should find the cases where we really need this, and possibly conditionally compile it. Index: ompi/mca/common/sm/common_sm_mmap.c === --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) @@ -163,6 +163,7 @@ /* initialize the segment - only the first process to open the file */ +bzero( map->data_addr, size ); mem_offset = map->data_addr - (unsigned char *)map->map_seg; map->map_seg->seg_offset = mem_offset; map->map_seg->seg_size = size - mem_offset; george. On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM initialization race condition
IIRC, bzero is a gnu-ism. We should probably use memset instead. On Aug 21, 2008, at 5:40 AM, George Bosilca wrote: Terry, We use the feature defined by POSIX mmap where the area should be zero-filled when the file length is extended. What OS you're using when you see such problems ? Just in case, here is a patch that set the beginning of the mmaped region to zero, in case this is not done automatically. As in most cases this is an unnecessary overhead, we should find the cases where we really need this, and possibly conditionally compile it. Index: ompi/mca/common/sm/common_sm_mmap.c === --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) @@ -163,6 +163,7 @@ /* initialize the segment - only the first process to open the file */ +bzero( map->data_addr, size ); mem_offset = map->data_addr - (unsigned char *)map- >map_seg; map->map_seg->seg_offset = mem_offset; map->map_seg->seg_size = size - mem_offset; george. On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SM initialization race condition
Terry, We use the feature defined by POSIX mmap where the area should be zero- filled when the file length is extended. What OS you're using when you see such problems ? Just in case, here is a patch that set the beginning of the mmaped region to zero, in case this is not done automatically. As in most cases this is an unnecessary overhead, we should find the cases where we really need this, and possibly conditionally compile it. Index: ompi/mca/common/sm/common_sm_mmap.c === --- ompi/mca/common/sm/common_sm_mmap.c (revision 19377) +++ ompi/mca/common/sm/common_sm_mmap.c (working copy) @@ -163,6 +163,7 @@ /* initialize the segment - only the first process to open the file */ +bzero( map->data_addr, size ); mem_offset = map->data_addr - (unsigned char *)map- >map_seg; map->map_seg->seg_offset = mem_offset; map->map_seg->seg_size = size - mem_offset; george. On Aug 21, 2008, at 1:22 PM, Terry Dontje wrote: I've been seeing an intermittent (once every 4 hours looping on a quick initialization program) segv with the following stack trace. =>[1] mca_btl_sm_add_procs(btl = 0xfd7ffdb67ef0, nprocs = 2U, procs = 0x591560, peers = 0x591580, reachability = 0xfd7fffdff000), line 519 in "btl_sm.c" [2] mca_bml_r2_add_procs(nprocs = 2U, procs = 0x591560, bml_endpoints = 0x591500, reachable = 0xfd7fffdff000), line 222 in "bml_r2.c" [3] mca_pml_ob1_add_procs(procs = 0x5914c0, nprocs = 2U), line 248 in "pml_ob1.c" [4] ompi_mpi_init(argc = 1, argv = 0xfd7fffdff318, requested = 0, provided = 0xfd7fffdff234), line 651 in "ompi_mpi_init.c" [5] PMPI_Init(argc = 0xfd7fffdff2ec, argv = 0xfd7fffdff2e0), line 90 in "pinit.c" [6] main(argc = 1, argv = 0xfd7fffdff318), line 82 in "buffer.c" I believe the problem is that mca_btl_sm_component.shm_fifo[j] contains uninitialized data causes the loop on line 504 in btl_sm.c to think that a remote rank has set its fifo address. Has anyone else seen the above happening? --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature