OK, please send me a clean gdb backtrace : ulimit -c unlimited /* this should generate a core */ mpirun ... gdb mpirun core... bt
if no core gdb mpirun r -np ... --mca ... ... and after the crash bt then i can only review the code and hope i can find the root cause of the error i am unable to reproduce in my environment Cheers, Gilles On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > Hi, > The jenkins took your commit and applied automatically, I tried with mca > flag later. > Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > in our system, the cpuspeed daemon is off by default on all our nodes. > > > Regards > M > > > On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> Mike, >> >> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ? >> >> *both* are required (--mca rtc_freq_priority 0 is not enough without the >> patch) >> >> can you please confirm there is no >> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> (pseudo) file on your system ? >> >> if this still does not work for you, then this might be a different issue >> i was unable to reproduce. >> in this case, could you run mpirun under gdb and send a gdb stack trace ? >> >> >> Cheers, >> >> Gilles >> >> >> >> >> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il> >> wrote: >> >>> more info, specifying --mca rtc_freq_priority 0 explicitly, generates >>> different kind of fail: >>> >>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi >>> [vegas12:13887] *** Process received signal *** >>> [vegas12:13887] Signal: Segmentation fault (11) >>> [vegas12:13887] Signal code: Address not mapped (1) >>> [vegas12:13887] Failing at address: 0x20 >>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500] >>> [vegas12:13887] [ 1] >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50] >>> [vegas12:13887] [ 2] >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c] >>> [vegas12:13887] [ 3] >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d] >>> [vegas12:13887] [ 4] >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] >>> [vegas12:13887] [ 5] >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] >>> [vegas12:13887] [ 6] >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] >>> [vegas12:13887] *** End of error message *** >>> Segmentation fault (core dumped) >>> >>> >>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman <mi...@dev.mellanox.co.il> >>> wrote: >>> >>>> Hi, >>>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero >>>> priority" did not help and jenkins stilll fails as before. >>>> The ompi was configured: >>>> --with-platform=contrib/platform/mellanox/optimized >>>> --with-ompi-param-check --enable-picky --with-knem --with-mxm --with-fca >>>> >>>> The run was on single node: >>>> >>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >>>> -np 8 -mca btl sm,tcp >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi >>>> [vegas12:13834] *** Process received signal *** >>>> [vegas12:13834] Signal: Segmentation fault (11) >>>> [vegas12:13834] Signal code: Address not mapped (1) >>>> [vegas12:13834] Failing at address: (nil) >>>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500] >>>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d] >>>> [vegas12:13834] [ 2] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f] >>>> [vegas12:13834] [ 3] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b] >>>> [vegas12:13834] [ 4] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036] >>>> [vegas12:13834] [ 5] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056] >>>> [vegas12:13834] [ 6] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254] >>>> [vegas12:13834] [ 7] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613] >>>> [vegas12:13834] [ 8] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] >>>> [vegas12:13834] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] >>>> [vegas12:13834] [10] >>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] >>>> [vegas12:13834] *** End of error message *** >>>> Segmentation fault (core dumped) >>>> >>>> >>>> >>>> >>>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet < >>>> gilles.gouaillar...@gmail.com> wrote: >>>> >>>>> Mike and Ralph, >>>>> >>>>> i could not find a simple workaround. >>>>> >>>>> for the time being, i commited r31926 and invite those who face a >>>>> similar issue to use the following workaround : >>>>> export OMPI_MCA_rtc_freq_priority=0 >>>>> /* or mpirun --mca rtc_freq_priority 0 ... */ >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet < >>>>> gilles.gouaillar...@gmail.com> wrote: >>>>> >>>>>> in orte/mca/rtc/freq/rtc_freq.c at line 187 >>>>>> fp = fopen(filename, "r"); >>>>>> and filename is >>>>>> "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" >>>>>> >>>>>> there is no error check, so if fp is NULL, orte_getline() will call >>>>>> fgets() that will crash. >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14939.php >>>>> >>>> >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/06/14945.php >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14947.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14948.php >