with this fix - no failure.
Thanks!

On Mon, Jun 2, 2014 at 8:52 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Yep, that's the one. Should have fixed that problem
>
>
> On Jun 2, 2014, at 10:30 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>
> This one? "Fix typo that would cause a segfault if orte_startup_timeout
> was set"
> If so, it is still running.
>
>
> On Mon, Jun 2, 2014 at 8:16 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> You're still missing a commit that fixed this problem
>>
>> On Jun 2, 2014, at 9:44 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>>
>> The jenkins still failed (hang and killed by timeout after 3m) as below.
>> No env. mca params were used.
>>
>> Changes:
>>
>>    1. Revert r31926 and replace it with a more complete checking of
>>    availability and accessibility of the required freq control paths.
>>    2. Break the loop caused by retrying to send a message to a hop that
>>    is unknown by the TCP oob component. We attempt to provide a way for other
>>    components to try, but need to mark that the TCP component is not able to
>>    reach that process so the OOB base will know to give up.
>>
>>
>>
>> *19:36:19* + timeout -s SIGSEGV 3m 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>  -np 8 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi*19:36:19*
>>  [vegas12:03383] *** Process received signal ****19:36:19* [vegas12:03383] 
>> Signal: Segmentation fault (11)*19:36:19* [vegas12:03383] Signal code: 
>> Address not mapped (1)*19:36:19* [vegas12:03383] Failing at address: 
>> 0x20*19:36:19* [vegas12:03383] [ 0] 
>> /lib64/libpthread.so.0[0x3937c0f500]*19:36:19* [vegas12:03383] [ 1] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]*19:36:19*
>>  [vegas12:03383] [ 2] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]*19:36:19*
>>  [vegas12:03383] [ 3] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]*19:36:19*
>>  [vegas12:03383] [ 4] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]*19:36:19*
>>  [vegas12:03383] [ 5] 
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]*19:36:19* 
>> [vegas12:03383] [ 6] 
>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]*19:36:19*
>>  [vegas12:03383] *** End of error message ****19:36:20* Build step 'Execute 
>> shell' marked build as failure*19:36:21* [BFA] Scanning build for known 
>> causes...
>>
>>
>>
>> On Mon, Jun 2, 2014 at 7:00 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> I fixed this - key was that it only would happen if the MCA param
>>> orte_startup_timeout was set.
>>>
>>> It really does help, folks, if you include info on what MCA params were
>>> set when you get these failures. Otherwise, it is impossible to replicate
>>> the problem.
>>>
>>>
>>> On Jun 2, 2014, at 6:49 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> Hi guys
>>>
>>> I'm awake now and will take a look at this - thanks
>>> Ralph
>>>
>>> On Jun 2, 2014, at 6:34 AM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempiProgram
>>> terminated with signal 11, Segmentation fault.
>>>
>>>
>>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value
>>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>>> 607             opal_event_evtimer_del(timer->ev);
>>> Missing separate debuginfos, use: debuginfo-install
>>> glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64
>>> libpciaccess-0.13.1-2.el6.x86_64 numactl-2.0.7-6.el6.x86_64
>>> (gdb) bt
>>> #0  orte_plm_base_post_launch (fd=<value optimized out>, args=<value
>>> optimized out>, cbdata=0x7393b0) at base/plm_base_launch_support.c:607
>>> #1  0x00007ffff7b1076c in event_process_active_single_queue
>>> (base=0x630d30, flags=<value optimized out>) at event.c:1367
>>> #2  event_process_active (base=0x630d30, flags=<value optimized out>) at
>>> event.c:1437
>>> #3  opal_libevent2021_event_base_loop (base=0x630d30, flags=<value
>>> optimized out>) at event.c:1645
>>> #4  0x000000000040501d in orterun (argc=10, argv=0x7fffffffe208) at
>>> orterun.c:1080
>>> #5  0x00000000004039e4 in main (argc=10, argv=0x7fffffffe208) at
>>> main.c:13
>>>
>>>
>>> On Mon, Jun 2, 2014 at 3:31 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> OK,
>>>>
>>>> please send me a clean gdb backtrace :
>>>> ulimit -c unlimited
>>>> /* this should generate a core */
>>>> mpirun ...
>>>> gdb mpirun core...
>>>> bt
>>>>
>>>> if no core
>>>> gdb mpirun
>>>> r -np ... --mca ... ...
>>>> and after the crash
>>>> bt
>>>>
>>>> then i can only review the code and hope i can find the root cause of
>>>> the error i am unable to reproduce in my environment
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 2, 2014 at 9:03 PM, Mike Dubman <mi...@dev.mellanox.co.il>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>> The jenkins took your commit and applied automatically, I tried with
>>>>> mca flag later.
>>>>> Also, we don`t
>>>>> have /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor in our system,
>>>>> the cpuspeed daemon is off by default on all our nodes.
>>>>>
>>>>>
>>>>> Regards
>>>>> M
>>>>>
>>>>>
>>>>> On Mon, Jun 2, 2014 at 3:00 PM, Gilles Gouaillardet <
>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>>
>>>>>> Mike,
>>>>>>
>>>>>> did you apply the patch *and* mpirun --mca rtc_freq_priority 0 ?
>>>>>>
>>>>>> *both* are required (--mca rtc_freq_priority 0 is not enough without
>>>>>> the patch)
>>>>>>
>>>>>> can you please confirm there is no 
>>>>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>>>>>> (pseudo) file on your system ?
>>>>>>
>>>>>> if this still does not work for you, then this might be a different
>>>>>> issue i was unable to reproduce.
>>>>>> in this case, could you run mpirun under gdb and send a gdb stack
>>>>>> trace ?
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Gilles
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 2, 2014 at 8:26 PM, Mike Dubman <mi...@dev.mellanox.co.il
>>>>>> > wrote:
>>>>>>
>>>>>>> more info, specifying --mca rtc_freq_priority 0 explicitly,
>>>>>>> generates different kind of fail:
>>>>>>>
>>>>>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>>>>> -np 8 -mca btl sm,tcp --mca rtc_freq_priority 0
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>>>>>> [vegas12:13887] *** Process received signal ***
>>>>>>> [vegas12:13887] Signal: Segmentation fault (11)
>>>>>>> [vegas12:13887] Signal code: Address not mapped (1)
>>>>>>> [vegas12:13887] Failing at address: 0x20
>>>>>>> [vegas12:13887] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>>>>>> [vegas12:13887] [ 1]
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_plm_base_post_launch+0x90)[0x7ffff7dcbe50]
>>>>>>> [vegas12:13887] [ 2]
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-pal.so.0(opal_libevent2021_event_base_loop+0x8bc)[0x7ffff7b1076c]
>>>>>>> [vegas12:13887] [ 3]
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x126d)[0x40501d]
>>>>>>> [vegas12:13887] [ 4]
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>>>>>>> [vegas12:13887] [ 5]
>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>>>>>>> [vegas12:13887] [ 6]
>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>>>>>>> [vegas12:13887] *** End of error message ***
>>>>>>> Segmentation fault (core dumped)
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 2, 2014 at 2:24 PM, Mike Dubman <
>>>>>>> mi...@dev.mellanox.co.il> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>> This fix "orte_rtc_base_select: skip a RTC module if it has a zero
>>>>>>>> priority" did not help and jenkins stilll fails as before.
>>>>>>>> The ompi was configured:
>>>>>>>> --with-platform=contrib/platform/mellanox/optimized
>>>>>>>> --with-ompi-param-check --enable-picky --with-knem --with-mxm 
>>>>>>>> --with-fca
>>>>>>>>
>>>>>>>> The run was on single node:
>>>>>>>>
>>>>>>>> $/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun
>>>>>>>>  -np 8 -mca btl sm,tcp 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/examples/hello_usempi
>>>>>>>> [vegas12:13834] *** Process received signal ***
>>>>>>>> [vegas12:13834] Signal: Segmentation fault (11)
>>>>>>>> [vegas12:13834] Signal code: Address not mapped (1)
>>>>>>>> [vegas12:13834] Failing at address: (nil)
>>>>>>>> [vegas12:13834] [ 0] /lib64/libpthread.so.0[0x3937c0f500]
>>>>>>>> [vegas12:13834] [ 1] /lib64/libc.so.6(fgets+0x2d)[0x3937466f2d]
>>>>>>>> [vegas12:13834] [ 2] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x1f3f)[0x7ffff41f5f3f]
>>>>>>>> [vegas12:13834] [ 3] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_rtc_freq.so(+0x279b)[0x7ffff41f679b]
>>>>>>>> [vegas12:13834] [ 4] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_rtc_base_select+0xe6)[0x7ffff7ddc036]
>>>>>>>> [vegas12:13834] [ 5] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/openmpi/mca_ess_hnp.so(+0x4056)[0x7ffff725b056]
>>>>>>>> [vegas12:13834] [ 6] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/lib/libopen-rte.so.0(orte_init+0x174)[0x7ffff7d97254]
>>>>>>>> [vegas12:13834] [ 7] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(orterun+0x863)[0x404613]
>>>>>>>> [vegas12:13834] [ 8] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun(main+0x20)[0x4039e4]
>>>>>>>> [vegas12:13834] [ 9] 
>>>>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]
>>>>>>>> [vegas12:13834] [10] 
>>>>>>>> /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/ompi_install1/bin/mpirun[0x403909]
>>>>>>>> [vegas12:13834] *** End of error message ***
>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 2, 2014 at 10:19 AM, Gilles Gouaillardet <
>>>>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Mike and Ralph,
>>>>>>>>>
>>>>>>>>> i could not find a simple workaround.
>>>>>>>>>
>>>>>>>>> for the time being, i commited r31926 and invite those who face a
>>>>>>>>> similar issue to use the following workaround :
>>>>>>>>> export OMPI_MCA_rtc_freq_priority=0
>>>>>>>>> /* or mpirun --mca rtc_freq_priority 0 ... */
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 2, 2014 at 3:45 PM, Gilles Gouaillardet <
>>>>>>>>> gilles.gouaillar...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> in orte/mca/rtc/freq/rtc_freq.c at line 187
>>>>>>>>>> fp = fopen(filename, "r");
>>>>>>>>>> and filename is
>>>>>>>>>> "/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"
>>>>>>>>>>
>>>>>>>>>> there is no error check, so if fp is NULL, orte_getline() will
>>>>>>>>>> call fgets() that will crash.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14939.php
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14945.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14947.php
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14948.php
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2014/06/14949.php
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14950.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/06/14956.php
>>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/14957.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/06/14958.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14959.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/06/14960.php
>

Reply via email to