with this fix - no failure. Thanks!
On Mon, Jun 2, 2014 at 8:52 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yep, that's the one. Should have fixed that problem > > > On Jun 2, 2014, at 10:30 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > > This one? "Fix typo that would cause a segfault if orte_startup_timeout > was set" > If so, it is still running. > > > On Mon, Jun 2, 2014 at 8:16 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> You're still missing a commit that fixed this problem >> >> On Jun 2, 2014, at 9:44 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: >> >> The jenkins still failed (hang and killed by timeout after 3m) as below. >> No env. mca params were used. >> >> Changes: >> >> 1. Revert r31926 and replace it with a more complete checking of >> availability and accessibility of the required freq control paths. >> 2. Break the loop caused by retrying to send a message to a hop that >> is unknown by the TCP oob component. We attempt to provide a way for other >> components to try, but need to mark that the TCP component is not able to >> reach that process so the OOB base will know to give up. >> >> >> >> *19:36:19* + timeout -s SIGSEGV 3m >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >> -np 8 >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi*19:36:19* >> [vegas12:03383] *** Process received signal ****19:36:19* [vegas12:03383] >> Signal: Segmentation fault (11)*19:36:19* [vegas12:03383] Signal code: >> Address not mapped (1)*19:36:19* [vegas12:03383] Failing at address: >> 0x20*19:36:19* [vegas12:03383] [ 0] >> /lib64/libpthread.so.0[0x3937c0f500]*19:36:19* [vegas12:03383] [ 1] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]*19:36:19* >> [vegas12:03383] [ 2] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]*19:36:19* >> [vegas12:03383] [ 3] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]*19:36:19* >> [vegas12:03383] [ 4] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]*19:36:19* >> [vegas12:03383] [ 5] >> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]*19:36:19* >> [vegas12:03383] [ 6] >> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]*19:36:19* >> [vegas12:03383] *** End of error message ****19:36:20* Build step 'Execute >> shell' marked build as failure*19:36:21* [BFA] Scanning build for known >> causes... >> >> >> >> On Mon, Jun 2, 2014 at 7:00 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> I fixed this - key was that it only would happen if the MCA param >>> orte_startup_timeout was set. >>> >>> It really does help, folks, if you include info on what MCA params were >>> set when you get these failures. Otherwise, it is impossible to replicate >>> the problem. >>> >>> >>> On Jun 2, 2014, at 6:49 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> Hi guys >>> >>> I'm awake now and will take a look at this - thanks >>> Ralph >>> >>> On Jun 2, 2014, at 6:34 AM, Mike Dubman <mi...@dev.mellanox.co.il> >>> wrote: >>> >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 >>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram >>> terminated with signal 11, Segmentation fault. >>> >>> >>> #0 orte_plm_base_post_launch (fd=<value optimized out>, args=<value >>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607 >>> 607 opal_event_evtimer_del(timer->ev); >>> Missing separate debuginfos, use: debuginfo-install >>> glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 >>> libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-6.el6.x86_64 >>> (gdb) bt >>> #0 orte_plm_base_post_launch (fd=<value optimized out>, args=<value >>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607 >>> #1 0x00007ffff7b1076c in event_process_active_single_queue >>> (base=0x630d30, flags=<value optimized out>) at event.c:1367 >>> #2 event_process_active (base=0x630d30, flags=<value optimized out>) at >>> event.c:1437 >>> #3 opal_libevent2021_event_base_loop (base=0x630d30, flags=<value >>> optimized out>) at event.c:1645 >>> #4 0x000000000040501d in orterun (argc=10, argv=0x7fffffffe208) at >>> orterun.c:1080 >>> #5 0x00000000004039e4 in main (argc=10, argv=0x7fffffffe208) at >>> main.c:13 >>> >>> >>> On Mon, Jun 2, 2014 at 3:31 PM, Gilles Gouaillardet < >>> gilles.gouaillar...@gmail.com> wrote: >>> >>>> OK, >>>> >>>> please send me a clean gdb backtrace : >>>> ulimit -c unlimited >>>> /* this should generate a core */ >>>> mpirun ... >>>> gdb mpirun core... >>>> bt >>>> >>>> if no core >>>> gdb mpirun >>>> r -np ... --mca ... ... >>>> and after the crash >>>> bt >>>> >>>> then i can only review the code and hope i can find the root cause of >>>> the error i am unable to reproduce in my environment >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> >>>> >>>> >>>> On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il> >>>> wrote: >>>> >>>>> Hi, >>>>> The jenkins took your commit and applied automatically, I tried with >>>>> mca flag later. >>>>> Also, we don`t >>>>> have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor in our system, >>>>> the cpuspeed daemon is off by default on all our nodes. >>>>> >>>>> >>>>> Regards >>>>> M >>>>> >>>>> >>>>> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet < >>>>> gilles.gouaillar...@gmail.com> wrote: >>>>> >>>>>> Mike, >>>>>> >>>>>> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ? >>>>>> >>>>>> *both* are required (--mca rtc_freq_priority 0 is not enough without >>>>>> the patch) >>>>>> >>>>>> can you please confirm there is no >>>>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >>>>>> (pseudo) file on your system ? >>>>>> >>>>>> if this still does not work for you, then this might be a different >>>>>> issue i was unable to reproduce. >>>>>> in this case, could you run mpirun under gdb and send a gdb stack >>>>>> trace ? >>>>>> >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il >>>>>> > wrote: >>>>>> >>>>>>> more info, specifying --mca rtc_freq_priority 0 explicitly, >>>>>>> generates different kind of fail: >>>>>>> >>>>>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >>>>>>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi >>>>>>> [vegas12:13887] *** Process received signal *** >>>>>>> [vegas12:13887] Signal: Segmentation fault (11) >>>>>>> [vegas12:13887] Signal code: Address not mapped (1) >>>>>>> [vegas12:13887] Failing at address: 0x20 >>>>>>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500] >>>>>>> [vegas12:13887] [ 1] >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50] >>>>>>> [vegas12:13887] [ 2] >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c] >>>>>>> [vegas12:13887] [ 3] >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d] >>>>>>> [vegas12:13887] [ 4] >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] >>>>>>> [vegas12:13887] [ 5] >>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] >>>>>>> [vegas12:13887] [ 6] >>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] >>>>>>> [vegas12:13887] *** End of error message *** >>>>>>> Segmentation fault (core dumped) >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman < >>>>>>> mi...@dev.mellanox.co.il> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero >>>>>>>> priority" did not help and jenkins stilll fails as before. >>>>>>>> The ompi was configured: >>>>>>>> --with-platform=contrib/platform/mellanox/optimized >>>>>>>> --with-ompi-param-check --enable-picky --with-knem --with-mxm >>>>>>>> --with-fca >>>>>>>> >>>>>>>> The run was on single node: >>>>>>>> >>>>>>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun >>>>>>>> -np 8 -mca btl sm,tcp >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi >>>>>>>> [vegas12:13834] *** Process received signal *** >>>>>>>> [vegas12:13834] Signal: Segmentation fault (11) >>>>>>>> [vegas12:13834] Signal code: Address not mapped (1) >>>>>>>> [vegas12:13834] Failing at address: (nil) >>>>>>>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500] >>>>>>>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d] >>>>>>>> [vegas12:13834] [ 2] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f] >>>>>>>> [vegas12:13834] [ 3] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b] >>>>>>>> [vegas12:13834] [ 4] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036] >>>>>>>> [vegas12:13834] [ 5] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056] >>>>>>>> [vegas12:13834] [ 6] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254] >>>>>>>> [vegas12:13834] [ 7] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613] >>>>>>>> [vegas12:13834] [ 8] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] >>>>>>>> [vegas12:13834] [ 9] >>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] >>>>>>>> [vegas12:13834] [10] >>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] >>>>>>>> [vegas12:13834] *** End of error message *** >>>>>>>> Segmentation fault (core dumped) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet < >>>>>>>> gilles.gouaillar...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Mike and Ralph, >>>>>>>>> >>>>>>>>> i could not find a simple workaround. >>>>>>>>> >>>>>>>>> for the time being, i commited r31926 and invite those who face a >>>>>>>>> similar issue to use the following workaround : >>>>>>>>> export OMPI_MCA_rtc_freq_priority=0 >>>>>>>>> /* or mpirun --mca rtc_freq_priority 0 ... */ >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet < >>>>>>>>> gilles.gouaillar...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> in orte/mca/rtc/freq/rtc_freq.c at line 187 >>>>>>>>>> fp = fopen(filename, "r"); >>>>>>>>>> and filename is >>>>>>>>>> "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" >>>>>>>>>> >>>>>>>>>> there is no error check, so if fp is NULL, orte_getline() will >>>>>>>>>> call fgets() that will crash. >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14939.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14945.php >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14947.php >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14948.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/06/14949.php >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/06/14950.php >>> >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/06/14956.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14957.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/06/14958.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14959.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14960.php >