Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
I suspect we’ll just remove it, but I want to give the other developers a 
chance to chime in before doing so.

> On Dec 12, 2014, at 6:07 PM, Paul Hargrove  wrote:
> 
> Ralph,
> 
> If preserved at all, the existing code should probably be made to act more 
> intelligently when it encounters an unknown escape code.  I would suggest 
> advancing the length by some value (say 128?) that should be "big enough" and 
> printing a prominent warning.  So, the next time this bug surfaces it will be 
> (a) non-fatal and (b) easy to pin down.
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 5:46 PM, Ralph Castain  > wrote:
> Looking at the comments in the code, it appears that the rationale when 
> written was to provide support for REALLY ancient systems that didn’t have 
> some of these functions. Since that time, we added a configure check for 
> vsnprintf, so I’m adding Paul/Larry’s suggested code, protected by that 
> configure.
> 
> Since I suspect the configure check will always pass on any system of 
> interest today, I think this will solve the problem. We can then address the 
> broader question (e.g., do we even need this stuff any more at all?) in a 
> more leisurely way.
> 
> 
>> On Dec 12, 2014, at 5:42 PM, Larry Baker > > wrote:
>> 
>> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
>> 
>>> HOWEVER, while the patch catches the "%u" case, there are plenty of 
>>> potential ways to hit the same problem if, for instance, one uses "%zu" for 
>>> size_t.  Additionally, I've already noted that the code for "%ld", "%lx", 
>>> "%lX", "%lf" are all currently incorrect.
>> 
>> 
>> Not sure if it is applicable, but C99 has an  header which 
>> #include's  and provides additional capabilities, such as 
>> printf()/scanf() format macros for the types defined in .
>> 
>> Larry Baker
>> US Geological Survey
>> 650-329-5608 
>> ba...@usgs.gov 
>> 
> 
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov 
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16578.php



Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
Ralph,

If preserved at all, the existing code should probably be made to act more
intelligently when it encounters an unknown escape code.  I would suggest
advancing the length by some value (say 128?) that should be "big enough"
and printing a prominent warning.  So, the next time this bug surfaces it
will be (a) non-fatal and (b) easy to pin down.

-Paul

On Fri, Dec 12, 2014 at 5:46 PM, Ralph Castain  wrote:

> Looking at the comments in the code, it appears that the rationale when
> written was to provide support for REALLY ancient systems that didn't have
> some of these functions. Since that time, we added a configure check for
> vsnprintf, so I'm adding Paul/Larry's suggested code, protected by that
> configure.
>
> Since I suspect the configure check will always pass on any system of
> interest today, I think this will solve the problem. We can then address
> the broader question (e.g., do we even need this stuff any more at all?) in
> a more leisurely way.
>
>
> On Dec 12, 2014, at 5:42 PM, Larry Baker  wrote:
>
> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
>
> HOWEVER, while the patch catches the "%u" case, there are plenty of
> potential ways to hit the same problem if, for instance, one uses "%zu" for
> size_t.  Additionally, I've already noted that the code for "%ld", "%lx",
> "%lX", "%lf" are all currently incorrect.
>
>
> Not sure if it is applicable, but C99 has an  header which
> #include's  and provides additional capabilities, such as
> printf()/scanf() format macros for the types defined in .
>
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
>
>
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Looking at the comments in the code, it appears that the rationale when written 
was to provide support for REALLY ancient systems that didn’t have some of 
these functions. Since that time, we added a configure check for vsnprintf, so 
I’m adding Paul/Larry’s suggested code, protected by that configure.

Since I suspect the configure check will always pass on any system of interest 
today, I think this will solve the problem. We can then address the broader 
question (e.g., do we even need this stuff any more at all?) in a more 
leisurely way.


> On Dec 12, 2014, at 5:42 PM, Larry Baker  wrote:
> 
> On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:
> 
>> HOWEVER, while the patch catches the "%u" case, there are plenty of 
>> potential ways to hit the same problem if, for instance, one uses "%zu" for 
>> size_t.  Additionally, I've already noted that the code for "%ld", "%lx", 
>> "%lX", "%lf" are all currently incorrect.
> 
> 
> Not sure if it is applicable, but C99 has an  header which 
> #include's  and provides additional capabilities, such as 
> printf()/scanf() format macros for the types defined in .
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov 
> 



Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker
On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:

> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.


Not sure if it is applicable, but C99 has an  header which 
#include's  and provides additional capabilities, such as 
printf()/scanf() format macros for the types defined in .

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker
Or, slightly modified using a defensive coding style:

>   return 1 + vsnprintf(dummy, sizeof( dummy ), fmt, ap);

if you like sizeof() [which I prefer].  if you like sizeof:

>   return 1 + vsnprintf(dummy, sizeof dummy, fmt, ap);
> 


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote:

> OK, applying my attached patch (based on Gilles's observation) resolved the 
> problem!
> So I fully expect Ralph's plan to use "%d" to also resolve this.
> 
> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.
> 
> So, I ask: "Why isn't guess_strlen() just implemented as follows?"
> 
> /* From man vsnprintf:
>  *The functions snprintf and vsnprintf do not write more  than
>  * size  bytes (including the trailing '\0').  If the output was truncated
>  * due to this limit then the return value is  the  number  of  characters
>  * (not  including the trailing '\0') which would have been written to the
>  * final string if enough space had been available. 
>  */
> static int guess_strlen(const char *fmt, va_list ap)
> { 
>   char dummy[1];
>   return 1 + vsnprintf(dummy, 1, fmt, ap);
> }
> 
> 
> BTW: I do see some messages like "select: Interrupted system call" which I 
> assume are related to the timeout code (and thus the subject of a different 
> thread).
> 
> 
> -Paul 
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
>  wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain  wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
It’s a fair question - that code is ancient, however, so I’m surprised it has 
only surfaced now as a problem. I can take a look at making the change


> On Dec 12, 2014, at 5:22 PM, Paul Hargrove  wrote:
> 
> OK, applying my attached patch (based on Gilles's observation) resolved the 
> problem!
> So I fully expect Ralph's plan to use "%d" to also resolve this.
> 
> HOWEVER, while the patch catches the "%u" case, there are plenty of potential 
> ways to hit the same problem if, for instance, one uses "%zu" for size_t.  
> Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" 
> are all currently incorrect.
> 
> So, I ask: "Why isn't guess_strlen() just implemented as follows?"
> 
> /* From man vsnprintf:
>  *The functions snprintf and vsnprintf do not write more  than
>  * size  bytes (including the trailing '\0').  If the output was truncated
>  * due to this limit then the return value is  the  number  of  characters
>  * (not  including the trailing '\0') which would have been written to the
>  * final string if enough space had been available. 
>  */
> static int guess_strlen(const char *fmt, va_list ap)
> { 
>   char dummy[1];
>   return 1 + vsnprintf(dummy, 1, fmt, ap);
> }
> 
> 
> BTW: I do see some messages like "select: Interrupted system call" which I 
> assume are related to the timeout code (and thus the subject of a different 
> thread).
> 
> 
> -Paul 
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  > wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
> > wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain > wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove > > wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain > > wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain >> > wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove > wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Crud - sorry for delayed response. I was out for a bit.

I’ll just change it to %d as there is nothing magic about it being unsigned. 
How bizarre.


> On Dec 12, 2014, at 3:21 PM, Paul Hargrove  wrote:
> 
> NOTE:
> 
> The existing code for "%l." in guess_strlen() is garbage.
> The va_arg() macro calls all have "int" for the type!!
> 
> I am *only* testing a fix for the missing "%u" at the moment.
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  > wrote:
> Thanks, Gilles!
> 
> I was looking at that same code just now and completely missed the lack of a 
> case for '%u' (and '%lu').  I will add one now and see if that resolves the 
> problem
> 
> 
> -Paul
> 
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet 
> > wrote:
> Ralph,
> 
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
> 
> Makes sense ?
> 
> Cheers,
> 
> Gilles
> 
> Ralph Castain > wrote:
> Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
> down there. This is at the beginning of orte_init, so there are no threads 
> running nor has anything much happened.
> 
> Do you have any suggestions?
> 
> 
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove > > wrote:
>> 
>> Ralph,
>> 
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>> 
>> And so is "fmt":
>> 
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>> 
>> However, things have gone bad in guess_strlen():
>> 
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>> 
>> -Paul
>> 
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain > > wrote:
>> Hmmm….this is really odd. I actually do have a protection for that arch 
>> value being NULL, and you are in the code section when it isn’t.
>> 
>> Do you still have the core file around? If so, can you print out the value 
>> of the “arch” variable? It would be in the 
>> opal_hwloc_base_get_topo_signature level.
>> 
>> I’m wondering if that value has been hosed, and the problem is memory 
>> corruption somewhere.
>> 
>> 
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain >> > wrote:
>>> 
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>>> returning an architecture type for some reason, and I didn’t protect 
>>> against it.
>>> 
>>> 
 On Dec 11, 2014, at 7:39 PM, Paul Hargrove > wrote:
 
 Backtrace for the Solaris-10/SPARC SEGV appears below.
 I've changed the subject line to distinguish this from the earlier report.
 
 -Paul
 
 program terminated by signal SEGV (no mapping at the fault address)
 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
 Current function is guess_strlen
71   len += (int)strlen(sarg);
 (dbx) where
   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
 0x7d93b634 
 =>[2] guess_strlen(fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
 "printf.c"
   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
 "printf.c"
   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
 "printf.c"
   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
 "hwloc_base_util.c"
   [6] rte_init(), line 205 in "ess_hnp_module.c"
   [7] orte_init(pargc = 0x761c, pargv = 0x7610, 
 flags = 4U), line 148 in "orte_init.c"
   [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
 
 On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain > wrote:
 No, that looks different - it’s failing in mpirun itself. Can you get a 
 line number on it?
 
 Sorry for delay - I’m generating rc3 now
 
 
> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  > wrote:
> 
> Don't see an rc3 yet.

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
NOTE:

The existing code for "%l." in guess_strlen() is garbage.
The va_arg() macro calls all have "int" for the type!!

I am *only* testing a fix for the missing "%u" at the moment.

-Paul

On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove  wrote:

> Thanks, Gilles!
>
> I was looking at that same code just now and completely missed the lack of
> a case for '%u' (and '%lu').  I will add one now and see if that resolves
> the problem
>
>
> -Paul
>
> On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> Ralph,
>>
>> I cannot find a case for the %u format is guess_strlen
>> And since the default does not invoke va_arg()
>> I
>> it seems strlen is invoked on nnuma instead of arch
>>
>> Makes sense ?
>>
>> Cheers,
>>
>> Gilles
>>
>> Ralph Castain  wrote:
>> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad
>> address down there. This is at the beginning of orte_init, so there are no
>> threads running nor has anything much happened.
>>
>> Do you have any suggestions?
>>
>>
>> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> The "arch" variable looks fine:
>> Current function is opal_hwloc_base_get_topo_signature
>>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt,
>> arch);
>> (dbx) print arch
>> arch = 0x1001700a0 "sun4v"
>>
>> And so is "fmt":
>>
>> Current function is opal_asprintf
>>   194   length = opal_vasprintf(ptr, fmt, ap);
>> (dbx) print fmt
>> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>>
>> However, things have gone bad in guess_strlen():
>>
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) print sarg
>> sarg = 0x2 ""
>>
>> -Paul
>>
>> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>>
>>> Hmmmthis is really odd. I actually do have a protection for that arch
>>> value being NULL, and you are in the code section when it isn't.
>>>
>>> Do you still have the core file around? If so, can you print out the
>>> value of the "arch" variable? It would be in the
>>> opal_hwloc_base_get_topo_signature level.
>>>
>>> I'm wondering if that value has been hosed, and the problem is memory
>>> corruption somewhere.
>>>
>>>
>>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>>
>>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc
>>> isn't returning an architecture type for some reason, and I didn't protect
>>> against it.
>>>
>>>
>>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>>>
>>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>>> I've changed the subject line to distinguish this from the earlier
>>> report.
>>>
>>> -Paul
>>>
>>> program terminated by signal SEGV (no mapping at the fault address)
>>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>>> Current function is guess_strlen
>>>71   len += (int)strlen(sarg);
>>> (dbx) where
>>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
>>> 0x7d93b634
>>> =>[2] guess_strlen(fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
>>> "printf.c"
>>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
>>> "printf.c"
>>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
>>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>>> "printf.c"
>>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>>> in "hwloc_base_util.c"
>>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
>>> flags = 4U), line 148 in "orte_init.c"
>>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in
>>> "orterun.c"
>>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>>
>>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>>>
 No, that looks different - it's failing in mpirun itself. Can you get a
 line number on it?

 Sorry for delay - I'm generating rc3 now


 On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:

 Don't see an rc3 yet.

 My Solaris-10/SPARC runs fail slightly differently (see below).
 It looks sufficiently similar that it MIGHT be the same root cause.
 However, lacking an rc3 to test I figured it would be better to report
 this than to ignore it.

 The problem is present with both V8+ and V9 ABIs, and with both Gnu and
 Sun compilers.

 -Paul

 [niagara1:29881] *** Process received signal ***
 [niagara1:29881] Signal: Segmentation Fault (11)
 [niagara1:29881] Signal code: Address not mapped (1)
 [niagara1:29881] Failing at 

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
Thanks, Gilles!

I was looking at that same code just now and completely missed the lack of
a case for '%u' (and '%lu').  I will add one now and see if that resolves
the problem


-Paul

On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Ralph,
>
> I cannot find a case for the %u format is guess_strlen
> And since the default does not invoke va_arg()
> I
> it seems strlen is invoked on nnuma instead of arch
>
> Makes sense ?
>
> Cheers,
>
> Gilles
>
> Ralph Castain  wrote:
> Afraid I'm drawing a blank, Paul - I can't see how we got to a bad address
> down there. This is at the beginning of orte_init, so there are no threads
> running nor has anything much happened.
>
> Do you have any suggestions?
>
>
> On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>
> Ralph,
>
> The "arch" variable looks fine:
> Current function is opal_hwloc_base_get_topo_signature
>  2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
> (dbx) print arch
> arch = 0x1001700a0 "sun4v"
>
> And so is "fmt":
>
> Current function is opal_asprintf
>   194   length = opal_vasprintf(ptr, fmt, ap);
> (dbx) print fmt
> fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>
> However, things have gone bad in guess_strlen():
>
> Current function is guess_strlen
>71   len += (int)strlen(sarg);
> (dbx) print sarg
> sarg = 0x2 ""
>
> -Paul
>
> On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>
>> Hmmmthis is really odd. I actually do have a protection for that arch
>> value being NULL, and you are in the code section when it isn't.
>>
>> Do you still have the core file around? If so, can you print out the
>> value of the "arch" variable? It would be in the
>> opal_hwloc_base_get_topo_signature level.
>>
>> I'm wondering if that value has been hosed, and the problem is memory
>> corruption somewhere.
>>
>>
>> On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>>
>> Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn't
>> returning an architecture type for some reason, and I didn't protect
>> against it.
>>
>>
>> On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>>
>> Backtrace for the Solaris-10/SPARC SEGV appears below.
>> I've changed the subject line to distinguish this from the earlier report.
>>
>> -Paul
>>
>> program terminated by signal SEGV (no mapping at the fault address)
>> 0x7d93b634: strlen+0x0014:  lduh [%o2], %o1
>> Current function is guess_strlen
>>71   len += (int)strlen(sarg);
>> (dbx) where
>>   [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at
>> 0x7d93b634
>> =>[2] guess_strlen(fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in
>> "printf.c"
>>   [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in
>> "printf.c"
>>   [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98
>> "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in
>> "printf.c"
>>   [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134
>> in "hwloc_base_util.c"
>>   [6] rte_init(), line 205 in "ess_hnp_module.c"
>>   [7] orte_init(pargc = 0x761c, pargv = 0x7610,
>> flags = 4U), line 148 in "orte_init.c"
>>   [8] orterun(argc = 7, argv = 0x77a8), line 856 in
>> "orterun.c"
>>   [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>>
>> On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>>
>>> No, that looks different - it's failing in mpirun itself. Can you get a
>>> line number on it?
>>>
>>> Sorry for delay - I'm generating rc3 now
>>>
>>>
>>> On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>>>
>>> Don't see an rc3 yet.
>>>
>>> My Solaris-10/SPARC runs fail slightly differently (see below).
>>> It looks sufficiently similar that it MIGHT be the same root cause.
>>> However, lacking an rc3 to test I figured it would be better to report
>>> this than to ignore it.
>>>
>>> The problem is present with both V8+ and V9 ABIs, and with both Gnu and
>>> Sun compilers.
>>>
>>> -Paul
>>>
>>> [niagara1:29881] *** Process received signal ***
>>> [niagara1:29881] Signal: Segmentation Fault (11)
>>> [niagara1:29881] Signal code: Address not mapped (1)
>>> [niagara1:29881] Failing at address: 2
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_bac
>>> ktrace_print+0x24
>>>
>>> /sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>>> /lib/libc.so.1:0xc5364
>>> /lib/libc.so.1:0xb9e64
>>> /lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>>>
>>> 

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Gilles Gouaillardet
Ralph,

I cannot find a case for the %u format is guess_strlen
And since the default does not invoke va_arg()
I
it seems strlen is invoked on nnuma instead of arch

Makes sense ?

Cheers,

Gilles

Ralph Castain  wrote:
>Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address 
>down there. This is at the beginning of orte_init, so there are no threads 
>running nor has anything much happened.
>
>
>Do you have any suggestions?
>
>
>
>On Dec 12, 2014, at 9:02 AM, Paul Hargrove  wrote:
>
>
>Ralph,
>
>
>The "arch" variable looks fine:
>
>Current function is opal_hwloc_base_get_topo_signature
>
> 2134                    nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch);
>
>(dbx) print arch
>
>arch = 0x1001700a0 "sun4v"
>
>
>And so is "fmt":
>
>
>Current function is opal_asprintf
>
>  194       length = opal_vasprintf(ptr, fmt, ap);
>
>(dbx) print fmt
>
>fmt = 0x7eeada98 "%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s"
>
>
>However, things have gone bad in guess_strlen():
>
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) print sarg
>
>sarg = 0x2 ""
>
>
>-Paul
>
>
>On Fri, Dec 12, 2014 at 2:24 AM, Ralph Castain  wrote:
>
>Hmmm….this is really odd. I actually do have a protection for that arch value 
>being NULL, and you are in the code section when it isn’t.
>
>
>Do you still have the core file around? If so, can you print out the value of 
>the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature 
>level.
>
>
>I’m wondering if that value has been hosed, and the problem is memory 
>corruption somewhere.
>
>
>
>On Dec 11, 2014, at 8:56 PM, Ralph Castain  wrote:
>
>
>Thanks Paul - I will post a fix for this tomorrow. Looks like Sparc isn’t 
>returning an architecture type for some reason, and I didn’t protect against 
>it.
>
>
>
>On Dec 11, 2014, at 7:39 PM, Paul Hargrove  wrote:
>
>
>Backtrace for the Solaris-10/SPARC SEGV appears below.
>
>I've changed the subject line to distinguish this from the earlier report.
>
>
>-Paul
>
>
>program terminated by signal SEGV (no mapping at the fault address)
>
>0x7d93b634: strlen+0x0014:      lduh     [%o2], %o1
>
>Current function is guess_strlen
>
>   71                       len += (int)strlen(sarg);
>
>(dbx) where
>
>  [1] strlen(0x2, 0x7300, 0x2, 0x80808080, 0x2, 0x80808080), at 
>0x7d93b634 
>
>=>[2] guess_strlen(fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7058), line 71 in 
>"printf.c"
>
>  [3] opal_vasprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ap = 0x7050), line 218 in 
>"printf.c"
>
>  [4] opal_asprintf(ptr = 0x70b8, fmt = 0x7eeada98 
>"%uN:%uS:%uL3:%uL2:%uL1:%uC:%uH:%s", ... = 0x807ede0103, ...), line 194 in 
>"printf.c"
>
>  [5] opal_hwloc_base_get_topo_signature(topo = 0x100128ea0), line 2134 in 
>"hwloc_base_util.c"
>
>  [6] rte_init(), line 205 in "ess_hnp_module.c"
>
>  [7] orte_init(pargc = 0x761c, pargv = 0x7610, flags 
>= 4U), line 148 in "orte_init.c"
>
>  [8] orterun(argc = 7, argv = 0x77a8), line 856 in "orterun.c"
>
>  [9] main(argc = 7, argv = 0x77a8), line 13 in "main.c"
>
>
>On Thu, Dec 11, 2014 at 7:17 PM, Ralph Castain  wrote:
>
>No, that looks different - it’s failing in mpirun itself. Can you get a line 
>number on it?
>
>
>Sorry for delay - I’m generating rc3 now
>
>
>
>On Dec 11, 2014, at 6:59 PM, Paul Hargrove  wrote:
>
>
>Don't see an rc3 yet.
>
>
>My Solaris-10/SPARC runs fail slightly differently (see below).
>
>It looks sufficiently similar that it MIGHT be the same root cause.
>
>However, lacking an rc3 to test I figured it would be better to report this 
>than to ignore it.
>
>
>The problem is present with both V8+ and V9 ABIs, and with both Gnu and Sun 
>compilers.
>
>
>-Paul
>
>
>[niagara1:29881] *** Process received signal ***
>
>[niagara1:29881] Signal: Segmentation Fault (11)
>
>[niagara1:29881] Signal code: Address not mapped (1)
>
>[niagara1:29881] Failing at address: 2
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_backtrace_print+0x24
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:0xaa160
>
>/lib/libc.so.1:0xc5364
>
>/lib/libc.so.1:0xb9e64
>
>/lib/libc.so.1:strlen+0x14 [ Signal 11 (SEGV)]
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_vasprintf+0x20
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_asprintf+0x30
>
>/sandbox/hargrove/OMPI/openmpi-1.8.4rc2-solaris10-sparcT2-gcc346-v8plus/INST/lib/libopen-pal.so.6.2.1:opal_hwloc_base_get_topo_signature+0x24c
>