Re: [OMPI devel] [patch] async-signal-safe signal handler

2013-12-18 Thread Jeff Squyres (jsquyres)
This patch looks good to me (sorry for the delay in replying -- MPI Forum + 
OMPI dev meeting got in the way).

Brian -- do you have any opinions on it?


On Dec 11, 2013, at 1:43 AM, Kawashima, Takahiro  
wrote:

> Hi,
> 
> Open MPI's signal handler (show_stackframe function defined in
> opal/util/stacktrace.c) calls non-async-signal-safe functions
> and it causes a problem.
> 
> See attached mpisigabrt.c. Passing corrupted memory to realloc(3)
> will cause SIGABRT and show_stackframe function will be invoked.
> But invoked show_stackframe function deadlocks in backtrace_symbols(3)
> on some systems because backtrace_symbols(3) calls malloc(3)
> internally and a deadlock of realloc/malloc mutex occurs.
> 
> Attached mpisigabrt.gstack.txt shows the stacktrace gotten
> by gdb in this deadlock situation on Ubuntu 12.04 LTS (precise)
> x86_64. Though I could not reproduce this behavior on RHEL 5/6,
> I can reproduce it also on K computer and its successor PRIMEHPC FX10.
> Passing non-heap memory to free(3) and double-free also cause
> this deadlock.
> 
> malloc (and backtrace_symbols) is not marked as async-signal-safe
> in POSIX and current glibc, though it seems to have been marked
> in old glibc. So we should not call it in the signal handler now.
> 
>  
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
>  http://cygwin.com/ml/libc-help/2013-06/msg5.html
> 
> I wrote a patch to address this issue. See the attached
> async-signal-safe-stacktrace.patch.
> 
> This patch calls backtrace_symbols_fd(3) instead of backtrace_symbols(3).
> Though backtrace_symbols_fd is not declared as async-signal-safe,
> it is described not to call malloc internally in its man. So it
> should be rather safer.
> 
> Output format of show_stackframe function is not changed by
> this patch. But the opal_backtrace_print function (backtrace
> framework) interface is changed for the output format compatibility.
> This requires changes in some additional files (ompi_mpi_abort.c
> etc.).
> 
> This patch also removes unnecessary fflush(3) calls, which are
> meaningless for write(2) system call but might cause a similar
> problem.
> 
> What do you think about this patch?
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] [patch] async-signal-safe signal handler

2013-12-11 Thread Kawashima, Takahiro
Hi,

Open MPI's signal handler (show_stackframe function defined in
opal/util/stacktrace.c) calls non-async-signal-safe functions
and it causes a problem.

See attached mpisigabrt.c. Passing corrupted memory to realloc(3)
will cause SIGABRT and show_stackframe function will be invoked.
But invoked show_stackframe function deadlocks in backtrace_symbols(3)
on some systems because backtrace_symbols(3) calls malloc(3)
internally and a deadlock of realloc/malloc mutex occurs.

Attached mpisigabrt.gstack.txt shows the stacktrace gotten
by gdb in this deadlock situation on Ubuntu 12.04 LTS (precise)
x86_64. Though I could not reproduce this behavior on RHEL 5/6,
I can reproduce it also on K computer and its successor PRIMEHPC FX10.
Passing non-heap memory to free(3) and double-free also cause
this deadlock.

malloc (and backtrace_symbols) is not marked as async-signal-safe
in POSIX and current glibc, though it seems to have been marked
in old glibc. So we should not call it in the signal handler now.

  
http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_04
  http://cygwin.com/ml/libc-help/2013-06/msg5.html

I wrote a patch to address this issue. See the attached
async-signal-safe-stacktrace.patch.

This patch calls backtrace_symbols_fd(3) instead of backtrace_symbols(3).
Though backtrace_symbols_fd is not declared as async-signal-safe,
it is described not to call malloc internally in its man. So it
should be rather safer.

Output format of show_stackframe function is not changed by
this patch. But the opal_backtrace_print function (backtrace
framework) interface is changed for the output format compatibility.
This requires changes in some additional files (ompi_mpi_abort.c
etc.).

This patch also removes unnecessary fflush(3) calls, which are
meaningless for write(2) system call but might cause a similar
problem.

What do you think about this patch?

Takahiro Kawashima,
MPI development team,
Fujitsu
Index: opal/mca/backtrace/backtrace.h
===
--- opal/mca/backtrace/backtrace.h	(revision 29841)
+++ opal/mca/backtrace/backtrace.h	(working copy)
@@ -34,11 +34,12 @@
 
 
 /*
- * print back trace to FILE file
+ * Print back trace to FILE file with a prefix for each line.
+ * First strip lines are not printed.
  *
  * \note some attempts made to be signal safe.
  */
-OPAL_DECLSPEC void opal_backtrace_print(FILE *file);
+OPAL_DECLSPEC int opal_backtrace_print(FILE *file, char *prefix, int strip);
 
 /*
  * Return back trace in buffer.  buffer will be allocated by the
Index: opal/mca/backtrace/execinfo/backtrace_execinfo.c
===
--- opal/mca/backtrace/execinfo/backtrace_execinfo.c	(revision 29841)
+++ opal/mca/backtrace/execinfo/backtrace_execinfo.c	(working copy)
@@ -20,6 +20,10 @@
 #include "opal_config.h"
 
 #include 
+#include 
+#ifdef HAVE_UNISTD_H
+#include 
+#endif
 #ifdef HAVE_EXECINFO_H
 #include 
 #endif
@@ -27,23 +31,31 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
-int i;
+int i, fd, len;
 int trace_size;
 void * trace[32];
-char ** messages = (char **)NULL;
+char buf[6];
 
+fd = fileno (file);
+if (-1 == fd) {
+return OPAL_ERR_BAD_PARAM;
+}
+
 trace_size = backtrace (trace, 32);
-messages = backtrace_symbols (trace, trace_size);
 
-for (i = 0; i < trace_size; i++) {
-fprintf(file, "[%d] func:%s\n", i, messages[i]);
-fflush(file);
+for (i = strip; i < trace_size; i++) {
+if (NULL != prefix) {
+write (fd, prefix, strlen (prefix));
+}
+len = snprintf (buf, sizeof(buf), "[%2d] ", i - strip);
+write (fd, buf, len);
+backtrace_symbols_fd ([i], 1, fd);
 }
 
-free(messages);
+return OPAL_SUCCESS;
 }
 
 
Index: opal/mca/backtrace/printstack/backtrace_printstack.c
===
--- opal/mca/backtrace/printstack/backtrace_printstack.c	(revision 29841)
+++ opal/mca/backtrace/printstack/backtrace_printstack.c	(working copy)
@@ -24,10 +24,12 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
 printstack(fileno(file));
+
+return OPAL_SUCCESS;
 }
 
 
Index: opal/mca/backtrace/none/backtrace_none.c
===
--- opal/mca/backtrace/none/backtrace_none.c	(revision 29841)
+++ opal/mca/backtrace/none/backtrace_none.c	(working copy)
@@ -23,9 +23,10 @@
 #include "opal/constants.h"
 #include "opal/mca/backtrace/backtrace.h"
 
-void
-opal_backtrace_print(FILE *file)
+int
+opal_backtrace_print(FILE *file, char *prefix, int strip)
 {
+return