On Tue, Feb 2, 2010 at 3:17 PM, hj lee <[email protected]> wrote: > >> Thank you very much for moving forward this issue. The stacktrace I posted >> was created with timestamd off. The strftime was called by syslog() call. >> Inside syslog, it calls strftime to get the time. Now I have a clear story >> how this seg fault happens. It happens very rarely, not easy to reproduce. >> >> As I I guessed, this seg fault was caused by setenv() which is called >> pcmk_startup(). These are those calls grepped from pacemaker source. >> 5 plugin.c 328 setenv("HA_debug", "1", 1); >> 6 plugin.c 332 setenv("HA_debug", "0", 1); >> 7 plugin.c 349 setenv("HA_logfile", value, 1); >> 8 plugin.c 354 setenv("HA_logfacility", value, 1); >> 9 plugin.c 355 setenv("HA_LOGFACILITY", value, 1); >> a plugin.c 374 setenv("HA_use_logd", value, 1); >> b plugin.c 514 setenv("HA_COMPRESSION", "bz2", 1); >> c plugin.c 515 setenv("HA_cluster_type", "openais", 1); >> >> >> The setenv() in glibc does realloc current environment variable >> (__environ) when new environment variable is added. The pcmk is adding 8 new >> variables while it is initialized. >> >> Here are a part of setenv() code from >> glibc-debuginfo-common-2.5-34.i386.rpm. The last_environ is a copy of >> __environ. >> 141 /* We allocated this space; we can extend it. */ >> 142 new_environ = (char **) realloc (last_environ, >> 143 (size + 2) * sizeof (char *)); >> >> >> The getenv() of glibc library copies __environ variable and loop it to >> find "TZ" environment variable. Here is the getenv() code. >> >> 81 for (ep = __environ; *ep != NULL; ++ep) >> 82 { >> 83 #if _STRING_ARCH_unaligned >> 84 uint16_t ep_start = *(uint16_t *) *ep; >> 85 #else >> 86 uint16_t ep_start = (((unsigned char *) *ep)[0] >> 87 | (((unsigned char *) *ep)[1] << 8)); >> 88 #endif >> 89 >> 90 if (name_start == ep_start && !strncmp (*ep + 2, name, len) >> 91 && (*ep)[len + 2] == '=') >> 92 return &(*ep)[len + 3]; >> 93 } >> >> Here is very top of my stacktrace: >> (gdb) bt >> >> #0 0x0013b7ee in getenv (name=<value optimized out>) at getenv.c:90 >> #1 0x00193cf0 in tzset_internal (always=<value optimized out>, >> explicit=<value optimized out>) at tzset.c:162 >> #2 0x0019480d in __tzset () at tzset.c:543 >> >> The seg fault exactly happens when getenv() is scheduled out between line >> 82 - 89, and setenv() changes __environ variable, and getenv() is scheduled >> again. Then getenv() has the old pointer that was freed by setenv(). Seg >> fault! >> >> So I stongely suggest two proposals I posted eariler. >> 1. Delay creating logsys thread after pcmk_startup() call >> 2. Remove most(hopefuly all of them) of setenv() in pcmk_startup(). If >> they are needed, then administrator can export it in the shell. >> >> Because setenv() is not a safe code as you can see, it should not be used. >> >> Thank very much. >> hj >> > > Hi, > > Obviously this is a glibc issue. The corosync and pacemaker did nothing > wrong. The two proposal are just work around, not fixing the root cause in > glibc. The glibc can fix this issue by adding LOCK in getenv() function. The > setenv() has LOCK/UNLOCK on beginning and on exiting. Add the same lock to > getenv() will prevent this issue. I am not sure how long it will take to fix > glibc or how hard. but in the meantime, corosync/pacemaker can use wrapper > functions for get/setenv() which will lock for get/setenv() with pthread > mutex, so that only one thread can get into get/setenv(). This may be a > better way to address this issue in corosync/pacemaker. > > I want to hear your opinion since this is one of release blocking issue. > Thank very much >
Hi, Think again. Sorry, the wrapper won't work. Because getenv() is called inside from syslog(). Thanks
_______________________________________________ Openais mailing list [email protected] https://lists.linux-foundation.org/mailman/listinfo/openais
