Good suggestion, Paul - I have committed it in r32407 and added it to cmr #4826
Thanks! Ralph On Aug 1, 2014, at 1:12 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Gilles, > > At the moment ompi/mca/osc/sm/osc_sm_component.c is using the following: > > #ifdef HAVE_GETPAGESIZE > pagesize = getpagesize(); > #else > pagesize = 4096; > #endif > > While other places in the code use sysconf(), but not always consistently. > > And on some systems _SC_PAGESIZE is spelled as _SC_PAGE_SIZE. > Fortunately configure already checks these variations for you. > > So, I suggest > > #ifdef HAVE_GETPAGESIZE > pagesize = getpagesize(); > #elif defined(_SC_PAGESIZE ) > pagesize = sysconf(_SC_PAGESIZE); > #elif defined(_SC_PAGE_SIZE) > pagesize = sysconf(_SC_PAGE_SIZE); > #else > pagesize = 65536; /* safer to overestimate than under */ > #endif > > > opal_pagesize() anyone? > > -Paul > > On Fri, Aug 1, 2014 at 12:50 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > Paul, > > you are absolutly right ! > > in ompi/mca/coll/ml/coll_ml_lmngr.c at line 53, > cm->lmngr_alignment is hard coded to 4096 > > as a proof of concept, i hard coded it to 65536 and now coll/ml works just > fine > > i will now write a patch that uses sysconf(_SC_PAGESIZE) instead > > Cheers, > > Gilles > > On 2014/08/01 15:56, Paul Hargrove wrote: >> Hmm, maybe this has nothing to do with big-endian. >> Below is a backtrace from ring_c on an IA64 platform (definitely >> little-endian) that looks very similar to me. >> >> It happens that sysconf(_SC_PAGESIZE) returns 64K on both of these systems. >> So, I wonder if that might be related. >> >> -Paul >> >> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >> [altix][[26769,1],0][/eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/openmpi-1.9a1r32386/ompi/mca/coll/ml/coll_ml_lmngr.c:231:mca_coll_ml_lmngr_init] >> COLL-ML [altix:20418] *** Process received signal *** >> [altix:20418] Signal: Segmentation fault (11) >> [altix:20418] Signal code: Invalid permissions (2) >> [altix:20418] Failing at address: 0x16 >> [altix][[26769,1],1][/eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/openmpi-1.9a1r32386/ompi/mca/coll/ml/coll_ml_lmngr.c:231:mca_coll_ml_lmngr_init] >> COLL-ML [altix:20419] *** Process received signal *** >> [altix:20419] Signal: Segmentation fault (11) >> [altix:20419] Signal code: Invalid permissions (2) >> [altix:20419] Failing at address: 0x16 >> [altix:20418] [ 0] [0xa000000000010800] >> [altix:20418] [ 1] /lib/libc.so.6.1(strlen-0x92e930)[0x200000000051b2a0] >> [altix:20418] [altix:20419] [ 0] [0xa000000000010800] >> [altix:20419] [ 1] [ 2] >> /lib/libc.so.6.1(strlen-0x92e930)[0x200000000051b2a0] >> [altix:20419] [ 2] >> /lib/libc.so.6.1(_IO_vfprintf-0x998610)[0x20000000004b15d0] >> [altix:20419] [ 3] /lib/libc.so.6.1(+0x82860)[0x20000000004b2860] >> [altix:20419] [ 4] >> /lib/libc.so.6.1(_IO_vfprintf-0x99f140)[0x20000000004aaaa0] >> [altix:20419] [ 5] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(+0xc5a70)[0x2000000001e55a70] >> [altix:20419] [ 6] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(+0xc84a0)[0x2000000001e584a0] >> [altix:20419] [ 7] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_lmngr_alloc+0x100f520)[0x2000000001e59110] >> [altix:20419] [ 8] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_allocate_block+0xf6e940)[0x2000000001db8540] >> [altix:20419] [ 9] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(+0x10130)[0x2000000001da0130] >> [altix:20419] [10] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(+0x19970)[0x2000000001da9970] >> [altix:20419] [11] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_comm_query+0xf6d6b0)[0x2000000001db5830] >> [altix:20419] [12] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(+0x22fbd0)[0x200000000028fbd0] >> [altix:20419] [13] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(+0x22fac0)[0x200000000028fac0] >> [altix:20419] [14] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(+0x22f7e0)[0x200000000028f7e0] >> [altix:20419] [15] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(+0x22eac0)[0x200000000028eac0] >> [altix:20419] [16] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(mca_coll_base_comm_select-0xbcbb90)[0x200000000027e080] >> [altix:20419] [17] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(ompi_mpi_init-0xd38e70)[0x2000000000110db0] >> [altix:20419] [18] >> /eng/home/PHHargrove/OMPI/openmpi-trunk-linux-ia64/INST/lib/libmpi.so.0(MPI_Init-0xc8bf40)[0x20000000001bdcf0] >> [altix:20419] [19] examples/ring_c[0x4000000000000c00] >> [altix:20419] [20] >> /lib/libc.so.6.1(__libc_start_main-0x9f56b0)[0x2000000000454590] >> [altix:20419] [21] examples/ring_c[0x4000000000000a20] >> [altix:20419] *** End of error message *** >> /lib/libc.so.6.1(_IO_vfprintf-0x998610)[0x20000000004b15d0] >> [altix:20418] [ 3] /lib/libc.so.6.1(+0x82860)[0x20000000004b2860] >> [altix:20418] [ 4] >> /lib/libc.so.6.1(_IO_vfprintf-0x99f140)[0x20000000004aaaa0] >> >> >> >> >> On Thu, Jul 31, 2014 at 11:47 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >>> Gilles's findings are consistent with mine which showed the SEGVs to be in >>> the coll/ml code. >>> I've built with --enable-debug and so below is a backtrace (well, two >>> actually) that might be helpful. >>> Unfortunately the output of the two ranks did get slightly entangled. >>> >>> -Paul >>> >>> $ ../INST/bin/mpirun -mca btl sm,self -np 2 examples/ring_c' >>> [bd-login][[43502,1],0][/home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/ompi/mca/coll/ml/coll_ml_lmngr.c:231:mca_coll_ml_lmngr_init] >>> COLL-ML [bd-login:09106] *** Process received signal *** >>> [bd-login][[43502,1],1][/home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/openmpi-1.9a1r32369/ompi/mca/coll/ml/coll_ml_lmngr.c:231:mca_coll_ml_lmngr_init] >>> COLL-ML [bd-login:09107] *** Process received signal *** >>> [bd-login:09107] Signal: Segmentation fault (11) >>> [bd-login:09107] Signal code: Address not mapped (1) >>> [bd-login:09107] Failing at address: 0x10 >>> [bd-login:09107] [ 0] [bd-login:09106] Signal: Segmentation fault (11) >>> [bd-login:09106] Signal code: Address not mapped (1) >>> [bd-login:09106] Failing at address: 0x10 >>> [bd-login:09106] [ 0] [0xfffa96c0418] >>> [bd-login:09106] [ 1] [0xfff8f580418] >>> [bd-login:09107] [ 1] /lib64/libc.so.6(_IO_vfprintf-0x157168)[0x80c9b5b968] >>> [bd-login:09107] [ 2] /lib64/libc.so.6(_IO_vfprintf-0x157168)[0x80c9b5b968] >>> [bd-login:09106] [ 2] /lib64/libc.so.6[0x80c9b600b4] >>> [bd-login:09106] [ 3] /lib64/libc.so.6[0x80c9b600b4] >>> [bd-login:09107] [ 3] /lib64/libc.so.6(_IO_vfprintf-0x157010)[0x80c9b5bac0] >>> [bd-login:09107] [ 4] /lib64/libc.so.6(_IO_vfprintf-0x157010)[0x80c9b5bac0] >>> [bd-login:09106] [ 4] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x66580)[0xfffa8296580] >>> [bd-login:09106] [ 5] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x67604)[0xfffa8297604] >>> [bd-login:09106] [ 6] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_lmngr_alloc-0x1af04)[0xfffa829784c] >>> [bd-login:09106] [ 7] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_allocate_block-0x607b4)[0xfffa8250d4c] >>> [bd-login:09106] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x66580)[0xfff8e156580] >>> [bd-login:09107] [ 5] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x67604)[0xfff8e157604] >>> [bd-login:09107] [ 6] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_lmngr_alloc-0x1af04)[0xfff8e15784c] >>> [bd-login:09107] [ 7] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_allocate_block-0x607b4)[0xfff8e110d4c] >>> [bd-login:09107] [ 8] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x165e4)[0xfff8e1065e4] >>> [bd-login:09107] [ 9] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x1a7d8)[0xfff8e10a7d8] >>> [bd-login:09107] [10] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_comm_query-0x61b50)[0xfff8e10f970] >>> [bd-login:09107] [11] [ 8] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x165e4)[0xfffa82465e4] >>> [bd-login:09106] [ 9] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(+0x1a7d8)[0xfffa824a7d8] >>> [bd-login:09106] [10] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/openmpi/mca_coll_ml.so(mca_coll_ml_comm_query-0x61b50)[0xfffa824f970] >>> [bd-login:09106] [11] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x165ba0)[0xfff8f4b5ba0] >>> [bd-login:09107] [12] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x165b14)[0xfff8f4b5b14] >>> [bd-login:09107] [13] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x165ba0)[0xfffa95f5ba0] >>> [bd-login:09106] [12] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x165b14)[0xfffa95f5b14] >>> [bd-login:09106] [13] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x1659a8)[0xfffa95f59a8] >>> [bd-login:09106] [14] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x1659a8)[0xfff8f4b59a8] >>> [bd-login:09107] [14] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x1657ac)[0xfffa95f57ac] >>> [bd-login:09106] [15] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(+0x1657ac)[0xfff8f4b57ac] >>> [bd-login:09107] [15] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(mca_coll_base_comm_select-0x9b89c)[0xfff8f4ae3ec] >>> [bd-login:09107] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(mca_coll_base_comm_select-0x9b89c)[0xfffa95ee3ec] >>> [bd-login:09106] [16] [16] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(ompi_mpi_init-0x13f790)[0xfff8f401408] >>> [bd-login:09107] [17] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(ompi_mpi_init-0x13f790)[0xfffa9541408] >>> [bd-login:09106] [17] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(MPI_Init-0xf28d4)[0xfffa9591c74] >>> [bd-login:09106] [18] examples/ring_c[0x1000099c] >>> [bd-login:09106] [19] >>> /home/hargrov1/OMPI/openmpi-trunk-linux-ppc64-gcc/INST/lib/libmpi.so.0(MPI_Init-0xf28d4)[0xfff8f451c74] >>> [bd-login:09107] [18] examples/ring_c[0x1000099c] >>> [bd-login:09107] [19] /lib64/libc.so.6[0x80c9b2bcd8] >>> [bd-login:09107] [20] /lib64/libc.so.6[0x80c9b2bcd8] >>> [bd-login:09106] [20] >>> /lib64/libc.so.6(__libc_start_main-0x184e00)[0x80c9b2bed0] >>> [bd-login:09107] *** End of error message *** >>> /lib64/libc.so.6(__libc_start_main-0x184e00)[0x80c9b2bed0] >>> [bd-login:09106] *** End of error message *** >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 0 on node bd-login exited on >>> signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> >>> >>> >>> >>> >>> >>> >>> On Thu, Jul 31, 2014 at 11:39 PM, Gilles Gouaillardet < >>> gilles.gouaillar...@iferc.org> wrote: >>> >>>> Paul and Ralph, >>>> >>>> for what it's worth : >>>> >>>> a) i faced the very same issue on my (sloooow) qemu emulated ppc64 vm >>>> b) i was able to run very basic programs when passing --mca coll ^ml to >>>> mpirun >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/08/01 12:30, Ralph Castain wrote: >>>> >>>> Yes, I fear this will require some effort to chase all the breakage down >>>> given that (to my knowledge, at least) we lack PPC machines in the devel >>>> group. >>>> >>>> >>>> On Jul 31, 2014, at 5:46 PM, Paul Hargrove <phhargr...@lbl.gov> >>>> <phhargr...@lbl.gov> wrote: >>>> >>>> >>>> On the path to verifying George's atomics patch, I have started just by >>>> verifying that I can still build the UNPATCHED trunk on each of the >>>> platforms I listed. >>>> >>>> I have tried two PPC64/Linux systems so far and am seeing the same problem >>>> on both. Though I can pass "make check" both platforms SEGV on >>>> mpirun -mca btl sm,self -np 2 examples/ring_c >>>> >>>> Is this the expected state of the trunk on big-endian systems? >>>> I am thinking in particular of >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15365.php in which >>>> Ralph wrote: >>>> >>>> Yeah, my fix won't work for big endian machines - this is going to be an >>>> issue across the >>>> code base now, so we'll have to troll and fix it. I was doing the minimal >>>> change required to >>>> fix the trunk in the meantime. >>>> >>>> If this big-endian failure is not known/expected let me know and I'll >>>> provide details. >>>> Since testing George's patch only requires "make check" I can proceed with >>>> that regardless. >>>> >>>> -Paul >>>> >>>> >>>> On Thu, Jul 31, 2014 at 4:25 PM, George Bosilca <bosi...@icl.utk.edu> >>>> <bosi...@icl.utk.edu> wrote: >>>> Awesome, thanks Paul. When the results will be in we will fix whatever is >>>> needed for these less common architectures. >>>> >>>> George. >>>> >>>> >>>> >>>> On Thu, Jul 31, 2014 at 7:24 PM, Paul Hargrove <phhargr...@lbl.gov> >>>> <phhargr...@lbl.gov> wrote: >>>> >>>> >>>> On Thu, Jul 31, 2014 at 4:22 PM, Paul Hargrove <phhargr...@lbl.gov> >>>> <phhargr...@lbl.gov> wrote: >>>> >>>> On Thu, Jul 31, 2014 at 4:13 PM, George Bosilca <bosi...@icl.utk.edu> >>>> <bosi...@icl.utk.edu> wrote: >>>> Paul, I know you have a pretty diverse range computers. Can you try to >>>> compile and run a "make check" with the following patch? >>>> >>>> I will see what I can do for ARMv7, MIPS, PPC and IA64 (or whatever subset >>>> of those is still supported). >>>> The ARM and MIPS system are emulators and take forever to build OMPI. >>>> However, I am not even sure how soon I'll get to start this testing. >>>> >>>> >>>> Add SPARC (v8plus and v9) to that list. >>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Future Technologies Group >>>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> _______________________________________________ >>>> devel mailing listde...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15411.php >>>> >>>> >>>> _______________________________________________ >>>> devel mailing listde...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15412.php >>>> >>>> >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Future Technologies Group >>>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> _______________________________________________ >>>> devel mailing listde...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15414.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing listde...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/07/15425.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15436.php >>>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Future Technologies Group >>> Computer and Data Sciences Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15438.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15439.php > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Future Technologies Group > Computer and Data Sciences Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15441.php