Hi guys I'm awake now and will take a look at this - thanks Ralph
On Jun 2, 2014, at 6:34 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun > -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram > terminated with signal 11, Segmentation fault. > > > #0 orte_plm_base_post_launch (fd=<value optimized out>, args=<value > optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607 > 607 opal_event_evtimer_del(timer->ev); > Missing separate debuginfos, use: debuginfo-install > glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 > libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-6.el6.x86_64 > (gdb) bt > #0 orte_plm_base_post_launch (fd=<value optimized out>, args=<value > optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607 > #1 0x00007ffff7b1076c in event_process_active_single_queue (base=0x630d30, > flags=<value optimized out>) at event.c:1367 > #2 event_process_active (base=0x630d30, flags=<value optimized out>) at > event.c:1437 > #3 opal_libevent2021_event_base_loop (base=0x630d30, flags=<value optimized > out>) at event.c:1645 > #4 0x000000000040501d in orterun (argc=10, argv=0x7fffffffe208) at > orterun.c:1080 > #5 0x00000000004039e4 in main (argc=10, argv=0x7fffffffe208) at main.c:13 > > > On Mon, Jun 2, 2014 at 3:31 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > OK, > > please send me a clean gdb backtrace : > ulimit -c unlimited > /* this should generate a core */ > mpirun ... > gdb mpirun core... > bt > > if no core > gdb mpirun > r -np ... --mca ... ... > and after the crash > bt > > then i can only review the code and hope i can find the root cause of the > error i am unable to reproduce in my environment > > Cheers, > > Gilles > > > > > On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > Hi, > The jenkins took your commit and applied automatically, I tried with mca flag > later. > Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor in > our system, the cpuspeed daemon is off by default on all our nodes. > > > Regards > M > > > On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > Mike, > > did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ? > > *both* are required (--mca rtc_freq_priority 0 is not enough without the > patch) > > can you please confirm there is no > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (pseudo) file on your > system ? > > if this still does not work for you, then this might be a different issue i > was unable to reproduce. > in this case, could you run mpirun under gdb and send a gdb stack trace ? > > > Cheers, > > Gilles > > > > > On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > more info, specifying --mca rtc_freq_priority 0 explicitly, generates > different kind of fail: > > $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun > -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi > [vegas12:13887] *** Process received signal *** > [vegas12:13887] Signal: Segmentation fault (11) > [vegas12:13887] Signal code: Address not mapped (1) > [vegas12:13887] Failing at address: 0x20 > [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500] > [vegas12:13887] [ 1] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50] > [vegas12:13887] [ 2] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c] > [vegas12:13887] [ 3] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d] > [vegas12:13887] [ 4] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] > [vegas12:13887] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] > [vegas12:13887] [ 6] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] > [vegas12:13887] *** End of error message *** > Segmentation fault (core dumped) > > > On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > Hi, > This fix "orte_rtc_base_select: skip a RTC module if it has a zero priority" > did not help and jenkins stilll fails as before. > The ompi was configured: > --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check > --enable-picky --with-knem --with-mxm --with-fca > > The run was on single node: > $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun > -np 8 -mca btl sm,tcp > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi > [vegas12:13834] *** Process received signal *** > [vegas12:13834] Signal: Segmentation fault (11) > [vegas12:13834] Signal code: Address not mapped (1) > [vegas12:13834] Failing at address: (nil) > [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500] > [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d] > [vegas12:13834] [ 2] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f] > [vegas12:13834] [ 3] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b] > [vegas12:13834] [ 4] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036] > [vegas12:13834] [ 5] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056] > [vegas12:13834] [ 6] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254] > [vegas12:13834] [ 7] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613] > [vegas12:13834] [ 8] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4] > [vegas12:13834] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd] > [vegas12:13834] [10] > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909] > [vegas12:13834] *** End of error message *** > Segmentation fault (core dumped) > > > > On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > Mike and Ralph, > > i could not find a simple workaround. > > for the time being, i commited r31926 and invite those who face a similar > issue to use the following workaround : > export OMPI_MCA_rtc_freq_priority=0 > /* or mpirun --mca rtc_freq_priority 0 ... */ > > Cheers, > > Gilles > > > > > On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > in orte/mca/rtc/freq/rtc_freq.c at line 187 > fp = fopen(filename, "r"); > and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor" > > there is no error check, so if fp is NULL, orte_getline() will call fgets() > that will crash. > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14939.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14945.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14947.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14948.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14949.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14950.php