> > Thank you for you info. I looked the diff. But still I think my seg
> > fault is not fixed by that patch. I turned off timestamp in
> > corosync.conf, so strftime is never got called in my test.
> >
> > Thanks
> > hj
> >
> >
> Thanks for your persistence.  The original backtrace you reported is a
> direct result of timestamp:on and the aforementioned patch (stack frame
> #3 calls strftime which only executes with timestamp: on).  You
> mentioned it still crashed with timestamps turned off so I took further
> investigation of the code.  According to pthreads man page,
> getenv/setenv are not thread safe either.  It is possible there is some
> thread related issue with pacemaker integration and the getenv calls
> within corosync.  Could you provide a backtrace of your crash with
> timestamp set to off?  Seems to work in my environment.
>
> Regards
> -steve
>

Hi again,

Thank you very much for moving forward this issue. The stacktrace I posted
was created with timestamd off. The strftime was called by syslog() call.
Inside syslog, it calls strftime to get the time. Now I have a clear story
how this seg fault happens. It happens very rarely, not easy to reproduce.

As I I guessed, this seg fault was caused by setenv() which is called
pcmk_startup(). These are those calls grepped from pacemaker source.
5 plugin.c        328 setenv("HA_debug",  "1", 1);
6 plugin.c        332 setenv("HA_debug",  "0", 1);
7 plugin.c        349 setenv("HA_logfile",  value, 1);
8 plugin.c        354 setenv("HA_logfacility",  value, 1);
9 plugin.c        355 setenv("HA_LOGFACILITY",  value, 1);
a plugin.c        374 setenv("HA_use_logd", value, 1);
b plugin.c        514 setenv("HA_COMPRESSION",  "bz2", 1);
c plugin.c        515 setenv("HA_cluster_type", "openais", 1);


The setenv() in glibc does realloc current environment variable (__environ)
when new environment variable is added. The pcmk is adding 8 new variables
while it is initialized.

Here are a part of setenv() code from
glibc-debuginfo-common-2.5-34.i386.rpm. The last_environ is a copy of
__environ.
141       /* We allocated this space; we can extend it.  */
142       new_environ = (char **) realloc (last_environ,
143                                        (size + 2) * sizeof (char *));


The getenv() of glibc library copies __environ variable and loop it to find
"TZ" environment variable. Here is the getenv() code.
 81       for (ep = __environ; *ep != NULL; ++ep)
 82         {
 83 #if _STRING_ARCH_unaligned
 84           uint16_t ep_start = *(uint16_t *) *ep;
 85 #else
 86           uint16_t ep_start = (((unsigned char *) *ep)[0]
 87                                | (((unsigned char *) *ep)[1] << 8));
 88 #endif
 89
 90           if (name_start == ep_start && !strncmp (*ep + 2, name, len)
 91               && (*ep)[len + 2] == '=')
 92             return &(*ep)[len + 3];
 93         }

Here is very top of my stacktrace:
(gdb) bt
#0  0x0013b7ee in getenv (name=<value optimized out>) at getenv.c:90
#1  0x00193cf0 in tzset_internal (always=<value optimized out>,
explicit=<value optimized out>) at tzset.c:162
#2  0x0019480d in __tzset () at tzset.c:543

The seg fault exactly happens when getenv() is scheduled out between line 82
- 89, and setenv() changes __environ variable, and getenv() is scheduled
again. Then getenv() has the old pointer that was freed by setenv(). Seg
fault!

So I stongely suggest two proposals I posted eariler.
1. Delay creating logsys thread after pcmk_startup() call
2. Remove most(hopefuly all of them) of setenv() in pcmk_startup(). If they
are needed, then administrator can export it in the shell.

Because setenv() is not a safe code as you can see, it should not be used.

Thank very much.
hj
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to