I fixed this - key was that it only would happen if the MCA param 
orte_startup_timeout was set.

It really does help, folks, if you include info on what MCA params were set 
when you get these failures. Otherwise, it is impossible to replicate the 
problem.


On Jun 2, 2014, at 6:49 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Hi guys
> 
> I'm awake now and will take a look at this - thanks
> Ralph
> 
> On Jun 2, 2014, at 6:34 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
> 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>  -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram
>>  terminated with signal 11, Segmentation fault.
>> 
>> 
>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value 
>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>> 607             opal_event_evtimer_del(timer->ev);
>> Missing separate debuginfos, use: debuginfo-install 
>> glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 
>> libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-6.el6.x86_64
>> (gdb) bt
>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value 
>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>> #1  0x00007ffff7b1076c in event_process_active_single_queue (base=0x630d30, 
>> flags=<value optimized out>) at event.c:1367
>> #2  event_process_active (base=0x630d30, flags=<value optimized out>) at 
>> event.c:1437
>> #3  opal_libevent2021_event_base_loop (base=0x630d30, flags=<value optimized 
>> out>) at event.c:1645
>> #4  0x000000000040501d in orterun (argc=10, argv=0x7fffffffe208) at 
>> orterun.c:1080
>> #5  0x00000000004039e4 in main (argc=10, argv=0x7fffffffe208) at main.c:13
>> 
>> 
>> On Mon, Jun 2, 2014 at 3:31 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> OK,
>> 
>> please send me a clean gdb backtrace :
>> ulimit -c unlimited
>> /* this should generate a core */
>> mpirun ...
>> gdb mpirun core...
>> bt
>> 
>> if no core
>> gdb mpirun
>> r -np ... --mca ... ...
>> and after the crash
>> bt
>> 
>> then i can only review the code and hope i can find the root cause of the 
>> error i am unable to reproduce in my environment
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>> Hi,
>> The jenkins took your commit and applied automatically, I tried with mca 
>> flag later.
>> Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor in 
>> our system, the cpuspeed daemon is off by default on all our nodes.
>> 
>> 
>> Regards
>> M
>> 
>> 
>> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> Mike,
>> 
>> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?
>> 
>> *both* are required (--mca rtc_freq_priority 0 is not enough without the 
>> patch)
>> 
>> can you please confirm there is no 
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (pseudo) file on your 
>> system ?
>> 
>> if this still does not work for you, then this might be a different issue i 
>> was unable to reproduce.
>> in this case, could you run mpirun under gdb and send a gdb stack trace ?
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>> more info, specifying --mca rtc_freq_priority 0 explicitly, generates 
>> different kind of fail:
>> 
>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>  -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>> [vegas12:13887] *** Process received signal ***
>> [vegas12:13887] Signal: Segmentation fault (11)
>> [vegas12:13887] Signal code: Address not mapped (1)
>> [vegas12:13887] Failing at address: 0x20
>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>> [vegas12:13887] [ 1] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]
>> [vegas12:13887] [ 2] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]
>> [vegas12:13887] [ 3] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
>> [vegas12:13887] [ 4] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>> [vegas12:13887] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>> [vegas12:13887] [ 6] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>> [vegas12:13887] *** End of error message ***
>> Segmentation fault (core dumped)
>> 
>> 
>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>> Hi,
>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero priority" 
>> did not help and jenkins stilll fails as before.
>> The ompi was configured:
>> --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check 
>> --enable-picky --with-knem --with-mxm --with-fca
>> 
>> The run was on single node:
>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>  -np 8 -mca btl sm,tcp 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>> [vegas12:13834] *** Process received signal ***
>> [vegas12:13834] Signal: Segmentation fault (11)
>> [vegas12:13834] Signal code: Address not mapped (1)
>> [vegas12:13834] Failing at address: (nil)
>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d]
>> [vegas12:13834] [ 2] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f]
>> [vegas12:13834] [ 3] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b]
>> [vegas12:13834] [ 4] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036]
>> [vegas12:13834] [ 5] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056]
>> [vegas12:13834] [ 6] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254]
>> [vegas12:13834] [ 7] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613]
>> [vegas12:13834] [ 8] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>> [vegas12:13834] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>> [vegas12:13834] [10] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>> [vegas12:13834] *** End of error message ***
>> Segmentation fault (core dumped)
>> 
>> 
>> 
>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> Mike and Ralph,
>> 
>> i could not find a simple workaround.
>> 
>> for the time being, i commited r31926 and invite those who face a similar 
>> issue to use the following workaround :
>> export OMPI_MCA_rtc_freq_priority=0
>> /* or mpirun --mca rtc_freq_priority 0 ... */
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> 
>> On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@gmail.com> wrote:
>> in orte/mca/rtc/freq/rtc_freq.c at line 187
>> fp = fopen(filename, "r");
>> and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
>> 
>> there is no error check, so if fp is NULL, orte_getline() will call fgets() 
>> that will crash.
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14939.php
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14945.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14947.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14948.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14949.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/06/14950.php
> 

Reply via email to