Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-11 Thread Ralph Castain
Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
returning an architecture type for some reason, and I didn’t protect against it.


> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
> 
> Backtrace for the Solaris-10/SPARC SEGV appears below.
> I've changed the subject line to distinguish this from the earlier report.
> 
> -Paul
> 
> program terminated by signal SEGV (no mapping at the fault address)
> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
> Current function is guess_strlen
>71   len += (int)strlen(sarg);
> (dbx) where
>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
> 0x7d93b634 
> =>[2] guess_strlen(fmt = 0x7eeada98 
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
> "printf.c"
>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
> "printf.c"
>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
> "printf.c"
>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
> "hwloc_base_util.c"
>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>   [7] orte_init(pargc = 0x761c, pargv = 0x7610, flags 
> = 4U), line 148 in "orte_init.c"
>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
> 
> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  > wrote:
> No, that looks different - it’s failing in mpirun itself. Can you get a line 
> number on it?
> 
> Sorry for delay - I’m generating rc3 now
> 
> 
>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove > > wrote:
>> 
>> Don't see an rc3 yet.
>> 
>> My Solaris-10/SPARC runs fail slightly differently (see below).
>> It looks sufficiently similar that it MIGHT be the same root cause.
>> However, lacking an rc3 to test I figured it would be better to report this 
>> than to ignore it.
>> 
>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
>> compilers.
>> 
>> -Paul
>> 
>> [niagara1:29881] *** Process received signal ***
>> [niagara1:29881] Signal: Segmentation Fault (11)
>> [niagara1:29881] Signal code: Address not mapped (1)
>> [niagara1:29881] Failing at address: 2
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>> /lib/libc.so.1:0xc5364
>> /lib/libc.so.1:0xb9e64
>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
>> [niagara1:29881] *** End of error message ***
>> Segmentation Fault - core dumped
>> 
>> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain > > wrote:
>> Ah crud - incomplete commit means we didn’t send the topo string. Will roll 
>> rc3 in a few minutes.
>> 
>> Thanks, Paul
>> Ralph
>> 
>>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove >> > wrote:
>>> 
>>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting 
>>> the following crash for both "-m32" and "-m64" builds:
>>> 
>>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
>>> examples/ring_c'
>>> [pcp-j-19:18762] *** Process received signal ***
>>> [pcp-j-19:18762] Signal: Segmentation Fault (11)
>>> [pcp-j-19:18762] Signal code: Address not mapped (1)
>>> [pcp-j-19:18762] Failing at address: 0
>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
>>>  [0xfd7ffaf237ba]
>>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
>>>  [0xfd7ffaf20ba1]

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-11 Thread Ralph Castain
I honestly think it has to be a selected interface, Gilles, else we will fail 
to connect.

> On Dec 11, 2014, at 8:26 PM, Gilles Gouaillardet 
>  wrote:
> 
> Paul,
> 
> about the five warnings :
> can you confirm you are running mpirun *not* on n15 nor n16 ?
> if my guess is correct, then you can get up to 5 warnings : mpirun + 2 orted 
> + 2 mpi tasks
> 
> do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in your 
> openmpi-mca-params.conf ?
> 
> here is attached a patch to fix this issue.
> what we really want is test there is a loopback interface, period.
> the current code (my bad for not having reviewed in a timely manner) seems to 
> check
> there is a *selected* loopback interface.
> 
> Cheers,
> 
> Gilles
> 
> On 2014/12/12 13:15, Paul Hargrove wrote:
>> Ralph,
>> 
>> Sorry to be the bearer of more bad news.
>> The "good" news is I've seen the new warning regarding the lack of a
>> loopback interface.
>> The BAD news is that it is occurring on a Linux cluster that I'ver verified
>> DOES have 'lo' configured on the front-end and compute nodes (UP and
>> RUNNING according to ifconfig).
>> 
>> Though run with "-np 2" the warning appears FIVE times.
>> ADDITIONALLY, there is a SEGV at exit!
>> 
>> Unfortunately, despite configuring with --enable-debug, I didn't get line
>> numbers from the core (and there was no backtrace printed).
>> 
>> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>> 
>> Let me know what tracing flags to apply to gather the info needed to debug
>> this.
>> 
>> -Paul
>> 
>> 
>> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
>> --
>> WARNING: No loopback interface was found. This can cause problems
>> when we spawn processes as they are likely to be unable to connect
>> back to their host daemon. Sadly, it may take awhile for the connect
>> attempt to fail, so you may experience a significant hang time.
>> 
>> You may wish to ctrl-c out of your job and activate loopback support
>> on at least one interface before trying again.
>> 
>> --
>> [... above message FOUR more times ...]
>> Process 1 exiting
>> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> --
>> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
>> 11 (Segmentation fault).
>> --
>> 
>> $ /sbin/ifconfig lo
>> loLink encap:Local Loopback
>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>   inet6 addr: ::1/128 Scope:Host
>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>   RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:0
>>   RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)
>> 
>> $ ssh n15 /sbin/ifconfig lo
>> loLink encap:Local Loopback
>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>   inet6 addr: ::1/128 Scope:Host
>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>   RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:0
>>   RX bytes:1509940 (1.4 MiB)  TX bytes:1509940 (1.4 MiB)
>> 
>> $ ssh n16 /sbin/ifconfig lo
>> loLink encap:Local Loopback
>>   inet addr:127.0.0.1  Mask:255.0.0.0
>>   inet6 addr: ::1/128 Scope:Host
>>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>   RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
>>   collisions:0 txqueuelen:0
>>   RX bytes:1543408 (1.4 MiB)  TX bytes:1543408 (1.4 MiB)
>> 
>> $ gdb examples/ring_c core.29728
>> [...]
>> (gdb) where
>> #0  0x002a97a19980 in ?? ()
>> #1  
>> #2  0x003a6d40607c in _Unwind_FindEnclosingFunction () from
>> /lib64/libgcc_s.so.1
>> #3  0x003a6d406b57 in _Unwind_RaiseException () from
>> /lib64/libgcc_s.so.1
>> #4  0x003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
>> #5  0x003a6c30ac50 in __pthread_unwind () from
>> /lib64/tls/libpthread.so.0
>> #6  0x003a6c305202 in sigcancel_handler () from
>> /lib64/tls/libpthread.so.0
>> #7  
>> #8  0x003a6b6bd9a2 in poll () from 

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-11 Thread Gilles Gouaillardet
Paul,

about the five warnings :
can you confirm you are running mpirun *not* on n15 nor n16 ?
if my guess is correct, then you can get up to 5 warnings : mpirun + 2
orted + 2 mpi tasks

do you have any oob_tcp_if_include or oob_tcp_if_exclude settings in
your openmpi-mca-params.conf ?

here is attached a patch to fix this issue.
what we really want is test there is a loopback interface, period.
the current code (my bad for not having reviewed in a timely manner)
seems to check
there is a *selected* loopback interface.

Cheers,

Gilles

On 2014/12/12 13:15, Paul Hargrove wrote:
> Ralph,
>
> Sorry to be the bearer of more bad news.
> The "good" news is I've seen the new warning regarding the lack of a
> loopback interface.
> The BAD news is that it is occurring on a Linux cluster that I'ver verified
> DOES have 'lo' configured on the front-end and compute nodes (UP and
> RUNNING according to ifconfig).
>
> Though run with "-np 2" the warning appears FIVE times.
> ADDITIONALLY, there is a SEGV at exit!
>
> Unfortunately, despite configuring with --enable-debug, I didn't get line
> numbers from the core (and there was no backtrace printed).
>
> All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).
>
> Let me know what tracing flags to apply to gather the info needed to debug
> this.
>
> -Paul
>
>
> $ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
> --
> WARNING: No loopback interface was found. This can cause problems
> when we spawn processes as they are likely to be unable to connect
> back to their host daemon. Sadly, it may take awhile for the connect
> attempt to fail, so you may experience a significant hang time.
>
> You may wish to ctrl-c out of your job and activate loopback support
> on at least one interface before trying again.
>
> --
> [... above message FOUR more times ...]
> Process 1 exiting
> Process 0 sending 10 to 1, tag 201 (2 processes in ring)
> Process 0 sent to 1
> Process 0 decremented value: 9
> Process 0 decremented value: 8
> Process 0 decremented value: 7
> Process 0 decremented value: 6
> Process 0 decremented value: 5
> Process 0 decremented value: 4
> Process 0 decremented value: 3
> Process 0 decremented value: 2
> Process 0 decremented value: 1
> Process 0 decremented value: 0
> Process 0 exiting
> --
> mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
> 11 (Segmentation fault).
> --
>
> $ /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)
>
> $ ssh n15 /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:1509940 (1.4 MiB)  TX bytes:1509940 (1.4 MiB)
>
> $ ssh n16 /sbin/ifconfig lo
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:16436  Metric:1
>   RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:1543408 (1.4 MiB)  TX bytes:1543408 (1.4 MiB)
>
> $ gdb examples/ring_c core.29728
> [...]
> (gdb) where
> #0  0x002a97a19980 in ?? ()
> #1  
> #2  0x003a6d40607c in _Unwind_FindEnclosingFunction () from
> /lib64/libgcc_s.so.1
> #3  0x003a6d406b57 in _Unwind_RaiseException () from
> /lib64/libgcc_s.so.1
> #4  0x003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
> #5  0x003a6c30ac50 in __pthread_unwind () from
> /lib64/tls/libpthread.so.0
> #6  0x003a6c305202 in sigcancel_handler () from
> /lib64/tls/libpthread.so.0
> #7  
> #8  0x003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6
> #9  0x002a978f8f7d in ?? ()
> #10 0x0021000e in ?? ()
> #11 0x in ?? ()
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> 

[OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-11 Thread Paul Hargrove
Ralph,

Sorry to be the bearer of more bad news.
The "good" news is I've seen the new warning regarding the lack of a
loopback interface.
The BAD news is that it is occurring on a Linux cluster that I'ver verified
DOES have 'lo' configured on the front-end and compute nodes (UP and
RUNNING according to ifconfig).

Though run with "-np 2" the warning appears FIVE times.
ADDITIONALLY, there is a SEGV at exit!

Unfortunately, despite configuring with --enable-debug, I didn't get line
numbers from the core (and there was no backtrace printed).

All of this appears below (and no, "-mca mtl psm" is not a typo or a joke).

Let me know what tracing flags to apply to gather the info needed to debug
this.

-Paul


$ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c
--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--
[... above message FOUR more times ...]
Process 1 exiting
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
--
mpirun noticed that process rank 0 with PID 0 on node n15 exited on signal
11 (Segmentation fault).
--

$ /sbin/ifconfig lo
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:481228 errors:0 dropped:0 overruns:0 frame:0
  TX packets:481228 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:81039065 (77.2 MiB)  TX bytes:81039065 (77.2 MiB)

$ ssh n15 /sbin/ifconfig lo
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:24885 errors:0 dropped:0 overruns:0 frame:0
  TX packets:24885 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:1509940 (1.4 MiB)  TX bytes:1509940 (1.4 MiB)

$ ssh n16 /sbin/ifconfig lo
loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:24938 errors:0 dropped:0 overruns:0 frame:0
  TX packets:24938 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:1543408 (1.4 MiB)  TX bytes:1543408 (1.4 MiB)

$ gdb examples/ring_c core.29728
[...]
(gdb) where
#0  0x002a97a19980 in ?? ()
#1  
#2  0x003a6d40607c in _Unwind_FindEnclosingFunction () from
/lib64/libgcc_s.so.1
#3  0x003a6d406b57 in _Unwind_RaiseException () from
/lib64/libgcc_s.so.1
#4  0x003a6d406c4c in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
#5  0x003a6c30ac50 in __pthread_unwind () from
/lib64/tls/libpthread.so.0
#6  0x003a6c305202 in sigcancel_handler () from
/lib64/tls/libpthread.so.0
#7  
#8  0x003a6b6bd9a2 in poll () from /lib64/tls/libc.so.6
#9  0x002a978f8f7d in ?? ()
#10 0x0021000e in ?? ()
#11 0x in ?? ()

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-11 Thread Paul Hargrove
Backtrace for the Solaris-10/SPARC SEGV appears below.
I've changed the subject line to distinguish this from the earlier report.

-Paul

program terminated by signal SEGV (no mapping at the fault address)
0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
Current function is guess_strlen
   71   len += (int)strlen(sarg);
(dbx) where
  [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
0x7d93b634
=>[2] guess_strlen(fmt = 0x7eeada98
"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
"printf.c"
  [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
"printf.c"
  [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
"printf.c"
  [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in
"hwloc_base_util.c"
  [6] rte_init(), line 205 in "ess_hnp_module.c"
  [7] orte_init(pargc = 0x761c, pargv = 0x7610,
flags = 4U), line 148 in "orte_init.c"
  [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
  [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"

On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:

> No, that looks different - it's failing in mpirun itself. Can you get a
> line number on it?
>
> Sorry for delay - I'm generating rc3 now
>
>
> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>
> Don't see an rc3 yet.
>
> My Solaris-10/SPARC runs fail slightly differently (see below).
> It looks sufficiently similar that it MIGHT be the same root cause.
> However, lacking an rc3 to test I figured it would be better to report
> this than to ignore it.
>
> The problem is present with both V8+ and V9 ABIs, and with both Gnu and
> Sun compilers.
>
> -Paul
>
> [niagara1:29881] *** Process received signal ***
> [niagara1:29881] Signal: Segmentation Fault (11)
> [niagara1:29881] Signal code: Address not mapped (1)
> [niagara1:29881] Failing at address: 2
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
> ktrace_print+0x24
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
> /lib/libc.so.1:0xc5364
> /lib/libc.so.1:0xb9e64
> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vas
> printf+0x20
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asp
> rintf+0x30
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwl
> oc_base_get_topo_signature+0x24c
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_ini
> t+0x2f8
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
>
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
> [niagara1:29881] *** End of error message ***
> Segmentation Fault - core dumped
>
> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain  wrote:
>
>> Ah crud - incomplete commit means we didn't send the topo string. Will
>> roll rc3 in a few minutes.
>>
>> Thanks, Paul
>> Ralph
>>
>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove  wrote:
>>
>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting
>> the following crash for both "-m32" and "-m64" builds:
>>
>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
>> examples/ring_c'
>> [pcp-j-19:18762] *** Process received signal ***
>> [pcp-j-19:18762] Signal: Segmentation Fault (11)
>> [pcp-j-19:18762] Signal code: Address not mapped (1)
>> [pcp-j-19:18762] Failing at address: 0
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
>> [0xfd7ffaf237ba]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
>> [0xfd7ffaf20ba1]
>> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
>> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
>> /lib/amd64/libc.so.1'strcmp+0x1a [0xfd7fff170fda] [Signal 11 (SEGV)]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
>> [0x4010b7]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
>> [0x400f2c]
>> [pcp-j-19:18762] *** End of error message ***
>> bash: line 

Re: [OMPI devel] [1.8.4rc2] orted SEGVs on Solaris-11/x86-64

2014-12-11 Thread Ralph Castain
No, that looks different - it’s failing in mpirun itself. Can you get a line 
number on it?

Sorry for delay - I’m generating rc3 now


> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
> 
> Don't see an rc3 yet.
> 
> My Solaris-10/SPARC runs fail slightly differently (see below).
> It looks sufficiently similar that it MIGHT be the same root cause.
> However, lacking an rc3 to test I figured it would be better to report this 
> than to ignore it.
> 
> The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
> compilers.
> 
> -Paul
> 
> [niagara1:29881] *** Process received signal ***
> [niagara1:29881] Signal: Segmentation Fault (11)
> [niagara1:29881] Signal code: Address not mapped (1)
> [niagara1:29881] Failing at address: 2
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
> /lib/libc.so.1:0xc5364
> /lib/libc.so.1:0xb9e64
> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_init+0x2f8
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
> [niagara1:29881] *** End of error message ***
> Segmentation Fault - core dumped
> 
> On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain  > wrote:
> Ah crud - incomplete commit means we didn’t send the topo string. Will roll 
> rc3 in a few minutes.
> 
> Thanks, Paul
> Ralph
> 
>> On Dec 11, 2014, at 3:08 PM, Paul Hargrove > > wrote:
>> 
>> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting 
>> the following crash for both "-m32" and "-m64" builds:
>> 
>> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
>> examples/ring_c'
>> [pcp-j-19:18762] *** Process received signal ***
>> [pcp-j-19:18762] Signal: Segmentation Fault (11)
>> [pcp-j-19:18762] Signal code: Address not mapped (1)
>> [pcp-j-19:18762] Failing at address: 0
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
>>  [0xfd7ffaf237ba]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
>>  [0xfd7ffaf20ba1]
>> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
>> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
>> /lib/amd64/libc.so.1'strcmp+0x1a [0xfd7fff170fda] [Signal 11 (SEGV)]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
>>  [0x4010b7]
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
>>  [0x400f2c]
>> [pcp-j-19:18762] *** End of error message ***
>> bash: line 1: 18762 Segmentation Fault  (core dumped) 
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca 
>> ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca 
>> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120 
>> ,172.18.0.120:50362 " 
>> --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh" -mca 
>> shmem_mmap_enable_nfs_warning "0"
>> 
>> Running gdb against a core generated by the 32-bit build gives line numbers:
>> #0  0xfea1cb45 in strcmp () from /lib/libc.so.1
>> #1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
>> at 
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
>> #2  0x08050fb1 in main (argc=26, argv=0x80479b0)
>> at 
>> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov 
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352 
>> 
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 
>> ___
>> devel 

Re: [OMPI devel] [1.8.4rc2] orted SEGVs on Solaris-11/x86-64

2014-12-11 Thread Paul Hargrove
Don't see an rc3 yet.

My Solaris-10/SPARC runs fail slightly differently (see below).
It looks sufficiently similar that it MIGHT be the same root cause.
However, lacking an rc3 to test I figured it would be better to report this
than to ignore it.

The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun
compilers.

-Paul

[niagara1:29881] *** Process received signal ***
[niagara1:29881] Signal: Segmentation Fault (11)
[niagara1:29881] Signal code: Address not mapped (1)
[niagara1:29881] Failing at address: 2
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
ktrace_print+0x24
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
/lib/libc.so.1:0xc5364
/lib/libc.so.1:0xb9e64
/lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vas
printf+0x20
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asp
rintf+0x30
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwl
oc_base_get_topo_signature+0x24c
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/openmpi/mca_ess_hnp.so:0x2d90
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-rte.so.7.0.5:orte_ini
t+0x2f8
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:orterun+0xaa8
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:main+0x14
/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/bin/orterun:_start+0x5c
[niagara1:29881] *** End of error message ***
Segmentation Fault - core dumped

On Thu, Dec 11, 2014 at 3:29 PM, Ralph Castain  wrote:

> Ah crud - incomplete commit means we didn't send the topo string. Will
> roll rc3 in a few minutes.
>
> Thanks, Paul
> Ralph
>
> On Dec 11, 2014, at 3:08 PM, Paul Hargrove  wrote:
>
> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting
> the following crash for both "-m32" and "-m64" builds:
>
> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
> examples/ring_c'
> [pcp-j-19:18762] *** Process received signal ***
> [pcp-j-19:18762] Signal: Segmentation Fault (11)
> [pcp-j-19:18762] Signal code: Address not mapped (1)
> [pcp-j-19:18762] Failing at address: 0
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
> [0xfd7ffaf237ba]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
> [0xfd7ffaf20ba1]
> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
> /lib/amd64/libc.so.1'strcmp+0x1a [0xfd7fff170fda] [Signal 11 (SEGV)]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
> [0x4010b7]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
> [0x400f2c]
> [pcp-j-19:18762] *** End of error message ***
> bash: line 1: 18762 Segmentation Fault  (core dumped)
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca
> ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca
> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120,
> 172.18.0.120:50362" --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh"
> -mca shmem_mmap_enable_nfs_warning "0"
>
> Running gdb against a core generated by the 32-bit build gives line
> numbers:
> #0  0xfea1cb45 in strcmp () from /lib/libc.so.1
> #1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
> at
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
> #2  0x08050fb1 in main (argc=26, argv=0x80479b0)
> at
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16514.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16515.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems 

[OMPI devel] [1.8.4rc2] build broken by default on SGI UV

2014-12-11 Thread Paul Hargrove
I think I've reported this earlier in the 1.8 series.
If I compile on an SGI UV (e.g. blacklight at PSC) configure picks up the
presence of xpmem headers and enables the vader BTL.
However, the port of vader to SGI's "flavor" of xpmem is incomplete and the
following build failure results:

make[2]: Entering directory
`/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi/mca/btl/vader'
  CC   btl_vader_module.lo
In file included from
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader.h:60,
 from
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:29:
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_endpoint.h:76:
error: expected specifier-qualifier-list before 'xpmem_apid_t'
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:
In function 'init_vader_endpoint':
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:197:
error: 'struct ' has no member named 'apid'
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:
In function 'mca_btl_vader_endpoint_destructor':
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:682:
error: 'struct ' has no member named 'apid'
/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/openmpi-1.8.4rc2/ompi/mca/btl/vader/btl_vader_module.c:683:
error: 'struct ' has no member named 'apid'
make[2]: *** [btl_vader_module.lo] Error 1
make[2]: Leaving directory
`/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi/mca/btl/vader'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/brashear/hargrove/OMPI/openmpi-1.8.4rc2-linux-x86_64-uv/BLD/ompi'
make: *** [all-recursive] Error 1

This can trivially be fixed in configure if one doesn't recognize the SGI
variant of xpmem.
I think (untested) that the following is sufficient:

--- ./ompi/mca/btl/vader/configure.m4~  2014-12-11 18:51:11.499654000 -0800
+++ ./ompi/mca/btl/vader/configure.m4   2014-12-11 18:51:52.289654000 -0800
@@ -23,7 +23,7 @@
 AC_ARG_WITH([xpmem],
 [AC_HELP_STRING([--with-xpmem(=DIR)],
 [Build with XPMEM kernel module support, searching for
headers in DIR])])
-OMPI_CHECK_WITHDIR([xpmem], [$with_xpmem], [include/xpmem.h
include/sn/xpmem.h])
+OMPI_CHECK_WITHDIR([xpmem], [$with_xpmem], [include/xpmem.h])

 AC_ARG_WITH([xpmem-libdir],
 [AC_HELP_STRING([--with-xpmem-libdir=DIR],


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-11 Thread Gilles Gouaillardet
George,

please allow me to jump in with naive comments ...

currently (master) both openib and usnic btl invokes opal_using_threads
in component_init() :

btl_openib_component_init(int *num_btl_modules,
  bool enable_progress_threads,
  bool enable_mpi_threads)
{
[...]
/* Currently refuse to run if MPI_THREAD_MULTIPLE is enabled */
if (opal_using_threads() && !mca_btl_base_thread_multiple_override) {
opal_output_verbose(5, opal_btl_base_framework.framework_output,
"btl:openib: MPI_THREAD_MULTIPLE not
suppported; skipping this component");
goto no_btls;
}

> The overall design in OMPI was that no OMPI module should be allowed to 
> decide if threads are on

does "OMPI module" exclude OPAL and ORTE module ?
if yes, did the btl move from OMPI down to OPAL have any impact ?

if not, then could/should opal_using_threads() abort and/or display an
error message if it is called too early
(at least in debug builds) ?

Cheers,

Gilles

On 2014/12/12 10:30, Ralph Castain wrote:
> Just to help me understand: I don't think this change actually changed any 
> behavior. However, it certainly *allows* a different behavior. Isn't that 
> true?
>
> If so, I guess the real question is for Pascal at Bull: why do you feel this 
> earlier setting is required?
>
>
>> On Dec 11, 2014, at 4:21 PM, George Bosilca  wrote:
>>
>> The overall design in OMPI was that no OMPI module should be allowed to 
>> decide if threads are on (thus it should not rely on the value returned by 
>> opal_using_threads during it's initialization stage). Instead, they should 
>> respect the level of thread support requested as an argument during the 
>> initialization step.
>>
>> And this is true even for the BTLs. The PML component init function is 
>> propagating the  enable_progress_threads and enable_mpi_threads, down to the 
>> BML, and then to the BTL. This 2 variables, enable_progress_threads and 
>> enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute 
>> the the value of the opal) using_thread (and that this patch moved).
>>
>> The setting of the opal_using_threads was delayed during the initialization 
>> to ensure that it's value was not used to select a specific thread-level in 
>> any module, a behavior that is allowed now with the new setting.
>>
>> A drastic change in behavior...
>>
>>   George.
>>
>>
>> On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain > > wrote:
>> Kewl - I'll fix. Thanks!
>>
>>> On Dec 9, 2014, at 12:32 AM, Pascal Deveze >> > wrote:
>>>
>>> Hi Ralph,
>>>  
>>> This in in the trunk.
>>>  
>>> De : devel [mailto:devel-boun...@open-mpi.org 
>>> ] De la part de Ralph Castain
>>> Envoyé : mardi 9 décembre 2014 09:32
>>> À : Open MPI Developers
>>> Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
>>> ompi/runtime/ompi_mpi_init.c is called to late
>>>  
>>> Hi Pascal
>>>  
>>> Is this in the trunk or in the 1.8 series (or both)?
>>>  
>>>  
>>> On Dec 9, 2014, at 12:28 AM, Pascal Deveze >> > wrote:
>>>  
>>>  
>>> In case where MPI is compiled with --enable-mpi-thread-multiple, a call to 
>>> opal_using_threads() always returns 0 in the routine 
>>> btl_xxx_component_init() of the BTLs, event if the application calls 
>>> MPI_Init_thread() with MPI_THREAD_MULTIPLE.
>>>  
>>> This is because opal_set_using_threads(true) in 
>>> ompi/runtime/ompi_mpi_init.c is called to late.
>>>  
>>> I propose the following patch that solves the problem for me:
>>>  
>>> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
>>> index 35509cf..c2370fc 100644
>>> --- a/ompi/runtime/ompi_mpi_init.c
>>> +++ b/ompi/runtime/ompi_mpi_init.c
>>> @@ -512,6 +512,13 @@ int ompi_mpi_init(int argc, char **argv, int 
>>> requested, int *provided)
>>>  }
>>> #endif
>>>  
>>> +/* If thread support was enabled, then setup OPAL to allow for
>>> +   them. */
>>> +if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>>> +(*provided != MPI_THREAD_SINGLE)) {
>>> +opal_set_using_threads(true);
>>> +}
>>> +
>>>  /* initialize datatypes. This step should be done early as it will
>>>   * create the local convertor and local arch used in the proc
>>>   * init.
>>> @@ -724,13 +731,6 @@ int ompi_mpi_init(int argc, char **argv, int 
>>> requested, int *provided)
>>> goto error;
>>>  }
>>>  
>>> -/* If thread support was enabled, then setup OPAL to allow for
>>> -   them. */
>>> -if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>>> -(*provided != MPI_THREAD_SINGLE)) {
>>> -opal_set_using_threads(true);
>>> -}
>>> -
>>>  /* start PML/BTL's */
>>>  ret = MCA_PML_CALL(enable(true));
>>>  if( OMPI_SUCCESS != ret ) 

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-11 Thread Ralph Castain
Just to help me understand: I don’t think this change actually changed any 
behavior. However, it certainly *allows* a different behavior. Isn’t that true?

If so, I guess the real question is for Pascal at Bull: why do you feel this 
earlier setting is required?


> On Dec 11, 2014, at 4:21 PM, George Bosilca  wrote:
> 
> The overall design in OMPI was that no OMPI module should be allowed to 
> decide if threads are on (thus it should not rely on the value returned by 
> opal_using_threads during it's initialization stage). Instead, they should 
> respect the level of thread support requested as an argument during the 
> initialization step.
> 
> And this is true even for the BTLs. The PML component init function is 
> propagating the  enable_progress_threads and enable_mpi_threads, down to the 
> BML, and then to the BTL. This 2 variables, enable_progress_threads and 
> enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute 
> the the value of the opal) using_thread (and that this patch moved).
> 
> The setting of the opal_using_threads was delayed during the initialization 
> to ensure that it's value was not used to select a specific thread-level in 
> any module, a behavior that is allowed now with the new setting.
> 
> A drastic change in behavior...
> 
>   George.
> 
> 
> On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain  > wrote:
> Kewl - I’ll fix. Thanks!
> 
>> On Dec 9, 2014, at 12:32 AM, Pascal Deveze > > wrote:
>> 
>> Hi Ralph,
>>  
>> This in in the trunk.
>>  
>> De : devel [mailto:devel-boun...@open-mpi.org 
>> ] De la part de Ralph Castain
>> Envoyé : mardi 9 décembre 2014 09:32
>> À : Open MPI Developers
>> Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
>> ompi/runtime/ompi_mpi_init.c is called to late
>>  
>> Hi Pascal
>>  
>> Is this in the trunk or in the 1.8 series (or both)?
>>  
>>  
>> On Dec 9, 2014, at 12:28 AM, Pascal Deveze > > wrote:
>>  
>>  
>> In case where MPI is compiled with --enable-mpi-thread-multiple, a call to 
>> opal_using_threads() always returns 0 in the routine 
>> btl_xxx_component_init() of the BTLs, event if the application calls 
>> MPI_Init_thread() with MPI_THREAD_MULTIPLE.
>>  
>> This is because opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c 
>> is called to late.
>>  
>> I propose the following patch that solves the problem for me:
>>  
>> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
>> index 35509cf..c2370fc 100644
>> --- a/ompi/runtime/ompi_mpi_init.c
>> +++ b/ompi/runtime/ompi_mpi_init.c
>> @@ -512,6 +512,13 @@ int ompi_mpi_init(int argc, char **argv, int requested, 
>> int *provided)
>>  }
>> #endif
>>  
>> +/* If thread support was enabled, then setup OPAL to allow for
>> +   them. */
>> +if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>> +(*provided != MPI_THREAD_SINGLE)) {
>> +opal_set_using_threads(true);
>> +}
>> +
>>  /* initialize datatypes. This step should be done early as it will
>>   * create the local convertor and local arch used in the proc
>>   * init.
>> @@ -724,13 +731,6 @@ int ompi_mpi_init(int argc, char **argv, int requested, 
>> int *provided)
>> goto error;
>>  }
>>  
>> -/* If thread support was enabled, then setup OPAL to allow for
>> -   them. */
>> -if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
>> -(*provided != MPI_THREAD_SINGLE)) {
>> -opal_set_using_threads(true);
>> -}
>> -
>>  /* start PML/BTL's */
>>  ret = MCA_PML_CALL(enable(true));
>>  if( OMPI_SUCCESS != ret ) {
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16459.php 
>> 
>>  
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16462.php 
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> 
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16463.php 
> 

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-11 Thread George Bosilca
The overall design in OMPI was that no OMPI module should be allowed to
decide if threads are on (thus it should not rely on the value
returned by opal_using_threads
during it's initialization stage). Instead, they should respect the level
of thread support requested as an argument during the initialization step.

And this is true even for the BTLs. The PML component init function is
propagating the  enable_progress_threads and enable_mpi_threads, down to
the BML, and then to the BTL. This 2 variables, enable_progress_threads and
enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute
the the value of the opal) using_thread (and that this patch moved).

The setting of the opal_using_threads was delayed during the initialization
to ensure that it's value was not used to select a specific thread-level in
any module, a behavior that is allowed now with the new setting.

A drastic change in behavior...

  George.


On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain  wrote:

> Kewl - I’ll fix. Thanks!
>
> On Dec 9, 2014, at 12:32 AM, Pascal Deveze  wrote:
>
> Hi Ralph,
>
> This in in the trunk.
>
> *De :* devel [mailto:devel-boun...@open-mpi.org
> ] *De la part de* Ralph Castain
> *Envoyé :* mardi 9 décembre 2014 09:32
> *À :* Open MPI Developers
> *Objet :* Re: [OMPI devel] Patch proposed: opal_set_using_threads(true)
> in ompi/runtime/ompi_mpi_init.c is called to late
>
> Hi Pascal
>
> Is this in the trunk or in the 1.8 series (or both)?
>
>
>
> On Dec 9, 2014, at 12:28 AM, Pascal Deveze  wrote:
>
>
> In case where MPI is compiled with --enable-mpi-thread-multiple, a call to
> opal_using_threads() always returns 0 in the routine
> btl_xxx_component_init() of the BTLs, event if the application calls
> MPI_Init_thread() with MPI_THREAD_MULTIPLE.
>
> This is because opal_set_using_threads(true) in
> ompi/runtime/ompi_mpi_init.c is called to late.
>
> I propose the following patch that solves the problem for me:
>
> diff --git a/ompi/runtime/ompi_mpi_init.c b/ompi/runtime/ompi_mpi_init.c
> index 35509cf..c2370fc 100644
> --- a/ompi/runtime/ompi_mpi_init.c
> +++ b/ompi/runtime/ompi_mpi_init.c
> @@ -512,6 +512,13 @@ int ompi_mpi_init(int argc, char **argv, int
> requested, int *provided)
>  }
> #endif
>
> +/* If thread support was enabled, then setup OPAL to allow for
> +   them. */
> +if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
> +(*provided != MPI_THREAD_SINGLE)) {
> +opal_set_using_threads(true);
> +}
> +
>  /* initialize datatypes. This step should be done early as it will
>   * create the local convertor and local arch used in the proc
>   * init.
> @@ -724,13 +731,6 @@ int ompi_mpi_init(int argc, char **argv, int
> requested, int *provided)
> goto error;
>  }
>
> -/* If thread support was enabled, then setup OPAL to allow for
> -   them. */
> -if ((OPAL_ENABLE_PROGRESS_THREADS == 1) ||
> -(*provided != MPI_THREAD_SINGLE)) {
> -opal_set_using_threads(true);
> -}
> -
>  /* start PML/BTL's */
>  ret = MCA_PML_CALL(enable(true));
>  if( OMPI_SUCCESS != ret ) {
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16459.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16462.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16463.php
>


Re: [OMPI devel] [1.8.4rc2] orted SEGVs on Solaris-11/x86-64

2014-12-11 Thread Ralph Castain
Ah crud - incomplete commit means we didn’t send the topo string. Will roll rc3 
in a few minutes.

Thanks, Paul
Ralph

> On Dec 11, 2014, at 3:08 PM, Paul Hargrove  wrote:
> 
> Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting the 
> following crash for both "-m32" and "-m64" builds:
> 
> $ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20 
> examples/ring_c'
> [pcp-j-19:18762] *** Process received signal ***
> [pcp-j-19:18762] Signal: Segmentation Fault (11)
> [pcp-j-19:18762] Signal code: Address not mapped (1)
> [pcp-j-19:18762] Failing at address: 0
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
>  [0xfd7ffaf237ba]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
>  [0xfd7ffaf20ba1]
> /lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
> /lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
> /lib/amd64/libc.so.1'strcmp+0x1a [0xfd7fff170fda] [Signal 11 (SEGV)]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
>  [0x4010b7]
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
>  [0x400f2c]
> [pcp-j-19:18762] *** End of error message ***
> bash: line 1: 18762 Segmentation Fault  (core dumped) 
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca ess 
> "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120 
> ,172.18.0.120:50362 " 
> --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh" -mca 
> shmem_mmap_enable_nfs_warning "0"
> 
> Running gdb against a core generated by the 32-bit build gives line numbers:
> #0  0xfea1cb45 in strcmp () from /lib/libc.so.1
> #1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
> at 
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
> #2  0x08050fb1 in main (argc=26, argv=0x80479b0)
> at 
> /shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16514.php



[OMPI devel] [1.8.4rc2] orted SEGVs on Solaris-11/x86-64

2014-12-11 Thread Paul Hargrove
Testing the 1.8.4rc2 tarball on my x86-64 Solaris-11 systems I am getting
the following crash for both "-m32" and "-m64" builds:

$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
examples/ring_c'
[pcp-j-19:18762] *** Process received signal ***
[pcp-j-19:18762] Signal: Segmentation Fault (11)
[pcp-j-19:18762] Signal code: Address not mapped (1)
[pcp-j-19:18762] Failing at address: 0
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'opal_backtrace_print+0x26
[0xfd7ffaf237ba]
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/lib/libopen-pal.so.6.2.1'show_stackframe+0x833
[0xfd7ffaf20ba1]
/lib/amd64/libc.so.1'__sighndlr+0x6 [0xfd7fff202cc6]
/lib/amd64/libc.so.1'call_user_handler+0x2aa [0xfd7fff1f648e]
/lib/amd64/libc.so.1'strcmp+0x1a [0xfd7fff170fda] [Signal 11 (SEGV)]
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'main+0x90
[0x4010b7]
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted'_start+0x6c
[0x400f2c]
[pcp-j-19:18762] *** End of error message ***
bash: line 1: 18762 Segmentation Fault  (core dumped)
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x64-ib-gcc452/INST/bin/orted -mca
ess "env" -mca orte_ess_jobid "911343616" -mca orte_ess_vpid 1 -mca
orte_ess_num_procs "2" -mca orte_hnp_uri "911343616.0;tcp://172.16.0.120,
172.18.0.120:50362" --tree-spawn -mca btl "sm,self,openib" -mca plm "rsh"
-mca shmem_mmap_enable_nfs_warning "0"

Running gdb against a core generated by the 32-bit build gives line numbers:
#0  0xfea1cb45 in strcmp () from /lib/libc.so.1
#1  0xfeef4900 in orte_daemon (argc=26, argv=0x80479b0)
at
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/orted/orted_main.c:789
#2  0x08050fb1 in main (argc=26, argv=0x80479b0)
at
/shared/OMPI/openmpi-1.8.4rc2-solaris11-x86-ib-gcc452/openmpi-1.8.4rc2/orte/tools/orted/orted.c:62

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Larry Baker
On 11 Dec 2014, at 2:12 PM, Paul Hargrove wrote:

> I believe Larry Baker of USGS is also a PGI user (in production, rather than 
> just testing as I do). 


That is correct.

Although we are running a rather old Rocks cluster kit (CentOS based) which is 
so old that we cannot run the latest PGI releases.  Some time after the first 
of the year I plan to update Rocks and PGI and Intel and Oracle and GNU.  I'm 
giving up on PathScale and AMD/Open64.  I have already updated all the cluster 
firmware.  I just get side tracked.

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Paul Hargrove
Howard,

I regularly test release candidates against the PGI installations on
NERSC's systems (and sometimes elsewhere).  In fact, have a test of
1.8.4rc2 against pgi-14.4 "in the pipe" right now.

I believe Larry Baker of USGS is also a PGI user (in production, rather
than just testing as I do).

-Paul

On Thu, Dec 11, 2014 at 1:34 PM, Jeff Squyres (jsquyres)  wrote:

> Howard --
>
> One thing I neglected to say -- if libfabric/usnic support on master is
> causing problems for you, you can configure without libfabric:
>
> ./configure --without-libfabric ...
>
> (which will, of course, also disable anything that requires libfabric)
>
> The intent is that we build things by default so that we can get at least
> smoke testing of as many features as possible -- especially testing that
> they don't interfere with others.  But we tend to put in options to shut
> off such things if they *do* cause problems.  Right now, libfabric is
> causing a few problems for you, so you should feel free to disable it until
> we figure out the integration problems (and if you could send me the
> details, I can have a look at what's going wrong).
>
> I'm sorry; I should have mentioned this earlier, but I assumed you knew
> about it / keep forgetting that you're still kinda new to our community and
> don't know all the conventions that we typically put in place!
>
> My bad.  :-(
>
>
>
> On Dec 11, 2014, at 10:45 AM, Jeff Squyres (jsquyres) 
> wrote:
>
> > On Dec 11, 2014, at 9:58 AM, Howard Pritchard 
> wrote:
> >
> >> Okay, I'll try to fix things.  problem in opal_datatype_internal.h,
> then a meltdown with libfabric owing to the fact that its probably
> >> only been used in a gnu env.  I'll open an issue on that one and assign
> it to Jeff.
> >
> > Ok.
> >
> > FWIW: I test with gcc and the intel compiler suite.  I do not have a PGI
> license to test with.
> >
> >> I think we should be turning this libfabric build off unless one asks
> for it.
> >
> > Obviously, I disagree.  :-)
> >
> > I'm sorry for the annoyances, but we have long since found out that
> features that are not enabled by default do not get tested in the wild and
> therefore do not get debugged.
> >
> > If you send me the details of the PGI problem, I'll be happy to look in
> to it.
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16511.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Jeff Squyres (jsquyres)
Howard --

One thing I neglected to say -- if libfabric/usnic support on master is causing 
problems for you, you can configure without libfabric:

./configure --without-libfabric ...

(which will, of course, also disable anything that requires libfabric)

The intent is that we build things by default so that we can get at least smoke 
testing of as many features as possible -- especially testing that they don't 
interfere with others.  But we tend to put in options to shut off such things 
if they *do* cause problems.  Right now, libfabric is causing a few problems 
for you, so you should feel free to disable it until we figure out the 
integration problems (and if you could send me the details, I can have a look 
at what's going wrong).

I'm sorry; I should have mentioned this earlier, but I assumed you knew about 
it / keep forgetting that you're still kinda new to our community and don't 
know all the conventions that we typically put in place!  

My bad.  :-(



On Dec 11, 2014, at 10:45 AM, Jeff Squyres (jsquyres)  
wrote:

> On Dec 11, 2014, at 9:58 AM, Howard Pritchard  wrote:
> 
>> Okay, I'll try to fix things.  problem in opal_datatype_internal.h, then a 
>> meltdown with libfabric owing to the fact that its probably
>> only been used in a gnu env.  I'll open an issue on that one and assign it 
>> to Jeff.   
> 
> Ok.
> 
> FWIW: I test with gcc and the intel compiler suite.  I do not have a PGI 
> license to test with.
> 
>> I think we should be turning this libfabric build off unless one asks for it.
> 
> Obviously, I disagree.  :-)
> 
> I'm sorry for the annoyances, but we have long since found out that features 
> that are not enabled by default do not get tested in the wild and therefore 
> do not get debugged.
> 
> If you send me the details of the PGI problem, I'll be happy to look in to it.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Introducing memkind + Adding component in mpool framework

2014-12-11 Thread Jeff Squyres (jsquyres)
Ok.  Howard asked me about this in person this week at the MPI Forum.  I think 
we all agree that this sounds like an interesting prospect; we just need to 
make some adjustments in the OMPI infrastructure to make it happen.  That will 
take some discussion.


On Dec 11, 2014, at 11:58 AM, Vishwanath Venkatesan  wrote:

> Hi Jeff & Ralph,
> 
> Thanks for the response, and sorry for the delay in my reply. Attending the 
> developers meeting sounds like a good idea, But I will be back from my 
> vacation only on the 15th. So I will not be able to close in on my 
> possibilities to attend the developers meeting before that. I will keep you 
> posted on this.
> 
> @Ralph: The wedding went really well! Thanks for asking :)
> 
> 
> Best,
> Vish
> 
> On Tue, Dec 2, 2014 at 10:27 PM, Jeff Squyres (jsquyres)  
> wrote:
> Vish --
> 
> In general, this sounds like a great idea.
> 
> We talked about this on the call today, and it looks like it's going to take 
> a bit of thought into how to integrate this into OMPI.  I.e., we might have 
> to adjust the mpool and/or allocator frameworks a bit first.
> 
> Is there any chance that you can attend the OMPI face-to-face dev meeting in 
> late January?
> 
> https://github.com/open-mpi/ompi/wiki/Meeting-2015-01
> 
> 
> On Nov 18, 2014, at 7:38 PM, Vishwanath Venkatesan  
> wrote:
> 
> > Hello all,
> >
> > I have been working on an implementation for supporting the use of 
> > MPI_Alloc_mem with our new allocator library called memkind 
> > (https://github.com/memkind/). The memkind library allows to allocate from 
> > different kinds of memory where, kinds implemented within the library 
> > enable the control of NUMA and page size features.  This could be leveraged 
> > conveniently with MPI_Alloc_mem.
> >
> > I was hoping to trigger the use of the memkind component by using either an 
> > info object or an mca parameter (mpirun -np x --mca mpool memkind ).
> > The modules of the mpool framework are loaded from components in the btl 
> > framework and not in the base of mpool. But in the case of my 
> > implementation, the component can remain independent from the btl 
> > framework. Is there a way to introduce priority for mpool component 
> > selection?
> >
> > Also, with the use of info objects in mpool_base_alloc.c, it looks like the 
> > same code path is taken irrespective of whether the info is null or not, as 
> > the branch conditions seem to be commented out. Could this be un-commented 
> > or will there be a different patch for this?
> >
> > Please let me know,
> > Thanks,
> > Vish
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/11/16320.php
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16408.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16509.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Introducing memkind + Adding component in mpool framework

2014-12-11 Thread Vishwanath Venkatesan
Hi Jeff & Ralph,

Thanks for the response, and sorry for the delay in my reply. Attending the
developers meeting sounds like a good idea, But I will be back from my
vacation only on the 15th. So I will not be able to close in on my
possibilities to attend the developers meeting before that. I will keep you
posted on this.

@Ralph: The wedding went really well! Thanks for asking :)


Best,
Vish

On Tue, Dec 2, 2014 at 10:27 PM, Jeff Squyres (jsquyres)  wrote:

> Vish --
>
> In general, this sounds like a great idea.
>
> We talked about this on the call today, and it looks like it's going to
> take a bit of thought into how to integrate this into OMPI.  I.e., we might
> have to adjust the mpool and/or allocator frameworks a bit first.
>
> Is there any chance that you can attend the OMPI face-to-face dev meeting
> in late January?
>
> https://github.com/open-mpi/ompi/wiki/Meeting-2015-01
>
>
> On Nov 18, 2014, at 7:38 PM, Vishwanath Venkatesan 
> wrote:
>
> > Hello all,
> >
> > I have been working on an implementation for supporting the use of
> MPI_Alloc_mem with our new allocator library called memkind (
> https://github.com/memkind/). The memkind library allows to allocate from
> different kinds of memory where, kinds implemented within the library
> enable the control of NUMA and page size features.  This could be leveraged
> conveniently with MPI_Alloc_mem.
> >
> > I was hoping to trigger the use of the memkind component by using either
> an info object or an mca parameter (mpirun -np x --mca mpool memkind ).
> > The modules of the mpool framework are loaded from components in the btl
> framework and not in the base of mpool. But in the case of my
> implementation, the component can remain independent from the btl
> framework. Is there a way to introduce priority for mpool component
> selection?
> >
> > Also, with the use of info objects in mpool_base_alloc.c, it looks like
> the same code path is taken irrespective of whether the info is null or
> not, as the branch conditions seem to be commented out. Could this be
> un-commented or will there be a different patch for this?
> >
> > Please let me know,
> > Thanks,
> > Vish
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16320.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16408.php
>


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Paul Kapinos

Jeff,
PGI compiler(s) are available on our Cluster:

$ module avail pgi

there are a lot of older versions, too:
$ module load DEPRECATED
$ module avail pgi


best

Paul


P.S. in our standard environmet, Intel compieler and Open MPI are active, so

$ module unload openmpi intel
$ module load pgi

P.S. We also have Sun/Oracle Studio:
$ module avail studio



On 12/11/14 19:45, Jeff Squyres (jsquyres) wrote:

Ok.

FWIW: I test with gcc and the intel compiler suite.  I do not have a PGI 
license to test with.



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, IT Center
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Jeff Squyres (jsquyres)
On Dec 11, 2014, at 9:58 AM, Howard Pritchard  wrote:

> Okay, I'll try to fix things.  problem in opal_datatype_internal.h, then a 
> meltdown with libfabric owing to the fact that its probably
> only been used in a gnu env.  I'll open an issue on that one and assign it to 
> Jeff.   

Ok.

FWIW: I test with gcc and the intel compiler suite.  I do not have a PGI 
license to test with.

> I think we should be turning this libfabric build off unless one asks for it.

Obviously, I disagree.  :-)

I'm sorry for the annoyances, but we have long since found out that features 
that are not enabled by default do not get tested in the wild and therefore do 
not get debugged.

If you send me the details of the PGI problem, I'll be happy to look in to it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Howard Pritchard
Okay, I'll try to fix things.  problem in opal_datatype_internal.h, then a
meltdown with libfabric owing to the fact that its probably
only been used in a gnu env.  I'll open an issue on that one and assign it
to Jeff.

I think we should be turning this libfabric build off unless one asks for
it.

Howard


2014-12-11 7:42 GMT-08:00 Jeff Squyres (jsquyres) :
>
> On Dec 11, 2014, at 7:40 AM, Ralph Castain  wrote:
>
> > I’m unaware of any conscious decision to cut pgi off - I think it has
> been more a case of nobody having a license to use for testing.
>
> +1
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16504.php
>


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Nathan Hjelm
On Thu, Dec 11, 2014 at 07:37:17AM -0800, Howard Pritchard wrote:
>Hi Folks,
>I'm trying to use mtt on a cluster where it looks like the only functional
>compiler that
>1) can build open mpi master
>2) can also build the ibm test suite
>may be pgi.  Can't compile write now, so I'm trying to fix it.  But I'm
>now wondering
>whether we are still supporting building open mpi with pgi compilers?  I'm
>using pgi
>14.4.

It *should* work but it really depends on if pgi fixed any of the
libnuma problems we complained about years ago. They have (had?) an
internal broken version of libnuma they use for their OpenMP stuff. We
had to build with -Mnoopenmp to get it to work at all. I got so fed up
with pgi that I dropped all support for using it with Open MPI on
Cielo. That said, now that pgi is part of nvidia things may be better
now.

I would check with HPC-3 and see how they build with pgi on our other
systems.

-Nathan


pgpFidnsg0l2w.pgp
Description: PGP signature


Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Jeff Squyres (jsquyres)
On Dec 11, 2014, at 7:40 AM, Ralph Castain  wrote:

> I’m unaware of any conscious decision to cut pgi off - I think it has been 
> more a case of nobody having a license to use for testing.

+1

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] still supporting pgi?

2014-12-11 Thread Ralph Castain
I’m unaware of any conscious decision to cut pgi off - I think it has been more 
a case of nobody having a license to use for testing.

> On Dec 11, 2014, at 7:37 AM, Howard Pritchard  wrote:
> 
> Hi Folks,
> 
> I'm trying to use mtt on a cluster where it looks like the only functional 
> compiler that
> 
> 1) can build open mpi master
> 2) can also build the ibm test suite
> 
> may be pgi.  Can't compile write now, so I'm trying to fix it.  But I'm now 
> wondering
> whether we are still supporting building open mpi with pgi compilers?  I'm 
> using pgi
> 14.4.
> 
> Thanks,
> 
> Howard
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16502.php



[OMPI devel] still supporting pgi?

2014-12-11 Thread Howard Pritchard
Hi Folks,

I'm trying to use mtt on a cluster where it looks like the only functional
compiler that

1) can build open mpi master
2) can also build the ibm test suite

may be pgi.  Can't compile write now, so I'm trying to fix it.  But I'm now
wondering
whether we are still supporting building open mpi with pgi compilers?  I'm
using pgi
14.4.

Thanks,

Howard


[OMPI devel] 1.8.4rc2 now available for testing

2014-12-11 Thread Ralph Castain
In the usual place - this is an early rc as it doesn’t yet contain the thread 
multiple fix that is impacting performance. However, I wanted to give people a 
chance to run all their non-threaded functional validation tests.

The release candidate includes a wide range of bug fixes as reported by users 
over the last month, many of which have been subsequently tested by the 
reporters. Still, I would appreciate a fairly thorough testing as this is 
expected to be the last 1.8 series release for some time.

http://www.open-mpi.org/software/ompi/v1.8/ 


We are still shooting to get this officially released before the holidays - 
which means, we have 2 weeks to complete the release cycle.

Thanks
Ralph