>
>
> Thank you very much for moving forward this issue. The stacktrace I posted
> was created with timestamd off. The strftime was called by syslog() call.
> Inside syslog, it calls strftime to get the time. Now I have a clear story
> how this seg fault happens. It happens very rarely, not easy to reproduce.
>
> As I I guessed, this seg fault was caused by setenv() which is called
> pcmk_startup(). These are those calls grepped from pacemaker source.
> 5 plugin.c        328 setenv("HA_debug",  "1", 1);
> 6 plugin.c        332 setenv("HA_debug",  "0", 1);
> 7 plugin.c        349 setenv("HA_logfile",  value, 1);
> 8 plugin.c        354 setenv("HA_logfacility",  value, 1);
> 9 plugin.c        355 setenv("HA_LOGFACILITY",  value, 1);
> a plugin.c        374 setenv("HA_use_logd", value, 1);
> b plugin.c        514 setenv("HA_COMPRESSION",  "bz2", 1);
> c plugin.c        515 setenv("HA_cluster_type", "openais", 1);
>
>
> The setenv() in glibc does realloc current environment variable (__environ)
> when new environment variable is added. The pcmk is adding 8 new variables
> while it is initialized.
>
> Here are a part of setenv() code from
> glibc-debuginfo-common-2.5-34.i386.rpm. The last_environ is a copy of
> __environ.
> 141       /* We allocated this space; we can extend it.  */
> 142       new_environ = (char **) realloc (last_environ,
> 143                                        (size + 2) * sizeof (char *));
>
>
> The getenv() of glibc library copies __environ variable and loop it to find
> "TZ" environment variable. Here is the getenv() code.
>
>  81       for (ep = __environ; *ep != NULL; ++ep)
>  82         {
>  83 #if _STRING_ARCH_unaligned
>  84           uint16_t ep_start = *(uint16_t *) *ep;
>  85 #else
>  86           uint16_t ep_start = (((unsigned char *) *ep)[0]
>  87                                | (((unsigned char *) *ep)[1] << 8));
>  88 #endif
>  89
>  90           if (name_start == ep_start && !strncmp (*ep + 2, name, len)
>  91               && (*ep)[len + 2] == '=')
>  92             return &(*ep)[len + 3];
>  93         }
>
> Here is very top of my stacktrace:
> (gdb) bt
>
> #0  0x0013b7ee in getenv (name=<value optimized out>) at getenv.c:90
> #1  0x00193cf0 in tzset_internal (always=<value optimized out>,
> explicit=<value optimized out>) at tzset.c:162
> #2  0x0019480d in __tzset () at tzset.c:543
>
> The seg fault exactly happens when getenv() is scheduled out between line
> 82 - 89, and setenv() changes __environ variable, and getenv() is scheduled
> again. Then getenv() has the old pointer that was freed by setenv(). Seg
> fault!
>
> So I stongely suggest two proposals I posted eariler.
> 1. Delay creating logsys thread after pcmk_startup() call
> 2. Remove most(hopefuly all of them) of setenv() in pcmk_startup(). If they
> are needed, then administrator can export it in the shell.
>
> Because setenv() is not a safe code as you can see, it should not be used.
>
> Thank very much.
> hj
>

Hi,

Obviously this is a glibc issue. The corosync and pacemaker did nothing
wrong.  The two proposal are just work around, not fixing the root cause in
glibc. The glibc can fix this issue by adding LOCK in getenv() function. The
setenv() has LOCK/UNLOCK on beginning and on exiting. Add the same lock to
getenv() will prevent this issue. I am not sure how long it will take to fix
glibc or how hard. but in the meantime, corosync/pacemaker can use wrapper
functions for get/setenv() which will lock for get/setenv() with pthread
mutex, so that only one thread can get into get/setenv(). This may be a
better way to address this issue in corosync/pacemaker.

I want to hear your opinion since this is one of release blocking issue.
Thank very much
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Reply via email to