You're still missing a commit that fixed this problem

On Jun 2, 2014, at 9:44 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:

> The jenkins still failed (hang and killed by timeout after 3m) as below. No 
> env. mca params were used.
> 
> Changes:
> Revert r31926 and replace it with a more complete checking of availability 
> and accessibility of the required freq control paths. 
> Break the loop caused by retrying to send a message to a hop that is unknown 
> by the TCP oob component. We attempt to provide a way for other components to 
> try, but need to mark that the TCP component is not able to reach that 
> process so the OOB base will know to give up.
> 
> 
> 19:36:19 + timeout -s SIGSEGV 3m 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>  -np 8 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
> 19:36:19 [vegas12:03383] *** Process received signal ***
> 19:36:19 [vegas12:03383] Signal: Segmentation fault (11)
> 19:36:19 [vegas12:03383] Signal code: Address not mapped (1)
> 19:36:19 [vegas12:03383] Failing at address: 0x20
> 19:36:19 [vegas12:03383] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
> 19:36:19 [vegas12:03383] [ 1] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]
> 19:36:19 [vegas12:03383] [ 2] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]
> 19:36:19 [vegas12:03383] [ 3] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
> 19:36:19 [vegas12:03383] [ 4] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
> 19:36:19 [vegas12:03383] [ 5] 
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
> 19:36:19 [vegas12:03383] [ 6] 
> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
> 19:36:19 [vegas12:03383] *** End of error message ***
> 19:36:20 Build step 'Execute shell' marked build as failure
> 19:36:21 [BFA] Scanning build for known causes...
> 
> 
> On Mon, Jun 2, 2014 at 7:00 PM, Ralph Castain <r...@open-mpi.org> wrote:
> I fixed this - key was that it only would happen if the MCA param 
> orte_startup_timeout was set.
> 
> It really does help, folks, if you include info on what MCA params were set 
> when you get these failures. Otherwise, it is impossible to replicate the 
> problem.
> 
> 
> On Jun 2, 2014, at 6:49 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Hi guys
>> 
>> I'm awake now and will take a look at this - thanks
>> Ralph
>> 
>> On Jun 2, 2014, at 6:34 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>> 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>  -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram
>>>  terminated with signal 11, Segmentation fault.
>>> 
>>> 
>>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value 
>>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>>> 607             opal_event_evtimer_del(timer->ev);
>>> Missing separate debuginfos, use: debuginfo-install 
>>> glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 
>>> libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-6.el6.x86_64
>>> (gdb) bt
>>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value 
>>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>>> #1  0x00007ffff7b1076c in event_process_active_single_queue (base=0x630d30, 
>>> flags=<value optimized out>) at event.c:1367
>>> #2  event_process_active (base=0x630d30, flags=<value optimized out>) at 
>>> event.c:1437
>>> #3  opal_libevent2021_event_base_loop (base=0x630d30, flags=<value 
>>> optimized out>) at event.c:1645
>>> #4  0x000000000040501d in orterun (argc=10, argv=0x7fffffffe208) at 
>>> orterun.c:1080
>>> #5  0x00000000004039e4 in main (argc=10, argv=0x7fffffffe208) at main.c:13
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 3:31 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> OK,
>>> 
>>> please send me a clean gdb backtrace :
>>> ulimit -c unlimited
>>> /* this should generate a core */
>>> mpirun ...
>>> gdb mpirun core...
>>> bt
>>> 
>>> if no core
>>> gdb mpirun
>>> r -np ... --mca ... ...
>>> and after the crash
>>> bt
>>> 
>>> then i can only review the code and hope i can find the root cause of the 
>>> error i am unable to reproduce in my environment
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il> 
>>> wrote:
>>> Hi,
>>> The jenkins took your commit and applied automatically, I tried with mca 
>>> flag later.
>>> Also, we don`t have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
>>> in our system, the cpuspeed daemon is off by default on all our nodes.
>>> 
>>> 
>>> Regards
>>> M
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> Mike,
>>> 
>>> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?
>>> 
>>> *both* are required (--mca rtc_freq_priority 0 is not enough without the 
>>> patch)
>>> 
>>> can you please confirm there is no 
>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor (pseudo) file on your 
>>> system ?
>>> 
>>> if this still does not work for you, then this might be a different issue i 
>>> was unable to reproduce.
>>> in this case, could you run mpirun under gdb and send a gdb stack trace ?
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il> 
>>> wrote:
>>> more info, specifying --mca rtc_freq_priority 0 explicitly, generates 
>>> different kind of fail:
>>> 
>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>  -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>> [vegas12:13887] *** Process received signal ***
>>> [vegas12:13887] Signal: Segmentation fault (11)
>>> [vegas12:13887] Signal code: Address not mapped (1)
>>> [vegas12:13887] Failing at address: 0x20
>>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>> [vegas12:13887] [ 1] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]
>>> [vegas12:13887] [ 2] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]
>>> [vegas12:13887] [ 3] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
>>> [vegas12:13887] [ 4] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>>> [vegas12:13887] [ 5] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>>> [vegas12:13887] [ 6] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>>> [vegas12:13887] *** End of error message ***
>>> Segmentation fault (core dumped)
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman <mi...@dev.mellanox.co.il> 
>>> wrote:
>>> Hi,
>>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero 
>>> priority" did not help and jenkins stilll fails as before.
>>> The ompi was configured:
>>> --with-platform=contrib/platform/mellanox/optimized --with-ompi-param-check 
>>> --enable-picky --with-knem --with-mxm --with-fca
>>> 
>>> The run was on single node:
>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>  -np 8 -mca btl sm,tcp 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>> [vegas12:13834] *** Process received signal ***
>>> [vegas12:13834] Signal: Segmentation fault (11)
>>> [vegas12:13834] Signal code: Address not mapped (1)
>>> [vegas12:13834] Failing at address: (nil)
>>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d]
>>> [vegas12:13834] [ 2] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f]
>>> [vegas12:13834] [ 3] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b]
>>> [vegas12:13834] [ 4] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036]
>>> [vegas12:13834] [ 5] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056]
>>> [vegas12:13834] [ 6] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254]
>>> [vegas12:13834] [ 7] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613]
>>> [vegas12:13834] [ 8] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>>> [vegas12:13834] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>>> [vegas12:13834] [10] 
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>>> [vegas12:13834] *** End of error message ***
>>> Segmentation fault (core dumped)
>>> 
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> Mike and Ralph,
>>> 
>>> i could not find a simple workaround.
>>> 
>>> for the time being, i commited r31926 and invite those who face a similar 
>>> issue to use the following workaround :
>>> export OMPI_MCA_rtc_freq_priority=0
>>> /* or mpirun --mca rtc_freq_priority 0 ... */
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>> in orte/mca/rtc/freq/rtc_freq.c at line 187
>>> fp = fopen(filename, "r");
>>> and filename is "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
>>> 
>>> there is no error check, so if fp is NULL, orte_getline() will call fgets() 
>>> that will crash.
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14939.php
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14945.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14947.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14948.php
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14949.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14950.php
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14956.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14957.php

Reply via email to