> > Thank you for you info. I looked the diff. But still I think my seg
> > fault is not fixed by that patch. I turned off timestamp in
> > corosync.conf, so strftime is never got called in my test.
> >
> > Thanks
> > hj
> >
> >
> Thanks for your persistence. The original backtrace you reported is a
> direct result of timestamp:on and the aforementioned patch (stack frame
> #3 calls strftime which only executes with timestamp: on). You
> mentioned it still crashed with timestamps turned off so I took further
> investigation of the code. According to pthreads man page,
> getenv/setenv are not thread safe either. It is possible there is some
> thread related issue with pacemaker integration and the getenv calls
> within corosync. Could you provide a backtrace of your crash with
> timestamp set to off? Seems to work in my environment.
>
> Regards
> -steve
>
Hi again,
Thank you very much for moving forward this issue. The stacktrace I posted
was created with timestamd off. The strftime was called by syslog() call.
Inside syslog, it calls strftime to get the time. Now I have a clear story
how this seg fault happens. It happens very rarely, not easy to reproduce.
As I I guessed, this seg fault was caused by setenv() which is called
pcmk_startup(). These are those calls grepped from pacemaker source.
5 plugin.c 328 setenv("HA_debug", "1", 1);
6 plugin.c 332 setenv("HA_debug", "0", 1);
7 plugin.c 349 setenv("HA_logfile", value, 1);
8 plugin.c 354 setenv("HA_logfacility", value, 1);
9 plugin.c 355 setenv("HA_LOGFACILITY", value, 1);
a plugin.c 374 setenv("HA_use_logd", value, 1);
b plugin.c 514 setenv("HA_COMPRESSION", "bz2", 1);
c plugin.c 515 setenv("HA_cluster_type", "openais", 1);
The setenv() in glibc does realloc current environment variable (__environ)
when new environment variable is added. The pcmk is adding 8 new variables
while it is initialized.
Here are a part of setenv() code from
glibc-debuginfo-common-2.5-34.i386.rpm. The last_environ is a copy of
__environ.
141 /* We allocated this space; we can extend it. */
142 new_environ = (char **) realloc (last_environ,
143 (size + 2) * sizeof (char *));
The getenv() of glibc library copies __environ variable and loop it to find
"TZ" environment variable. Here is the getenv() code.
81 for (ep = __environ; *ep != NULL; ++ep)
82 {
83 #if _STRING_ARCH_unaligned
84 uint16_t ep_start = *(uint16_t *) *ep;
85 #else
86 uint16_t ep_start = (((unsigned char *) *ep)[0]
87 | (((unsigned char *) *ep)[1] << 8));
88 #endif
89
90 if (name_start == ep_start && !strncmp (*ep + 2, name, len)
91 && (*ep)[len + 2] == '=')
92 return &(*ep)[len + 3];
93 }
Here is very top of my stacktrace:
(gdb) bt
#0 0x0013b7ee in getenv (name=<value optimized out>) at getenv.c:90
#1 0x00193cf0 in tzset_internal (always=<value optimized out>,
explicit=<value optimized out>) at tzset.c:162
#2 0x0019480d in __tzset () at tzset.c:543
The seg fault exactly happens when getenv() is scheduled out between line 82
- 89, and setenv() changes __environ variable, and getenv() is scheduled
again. Then getenv() has the old pointer that was freed by setenv(). Seg
fault!
So I stongely suggest two proposals I posted eariler.
1. Delay creating logsys thread after pcmk_startup() call
2. Remove most(hopefuly all of them) of setenv() in pcmk_startup(). If they
are needed, then administrator can export it in the shell.
Because setenv() is not a safe code as you can see, it should not be used.
Thank very much.
hj
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais