Hi,

After debugging stalled processes from our testsuite and prod, I highly suspect that the timeouts come from nss/nscd (see attached backtrace w/ debugging symbols):

- GDB shows they are stuck in a libnss-pgsql2 deadlock, as described in:
http://lists.fusionforge.org/pipermail/fusionforge-general/2014-March/002631.html
However since nscd is running, the process shouldn't even enter libnss-pgsql, so timeouts happen during a random nscd failure.

- GDB shows libpq checks the requestor UID *to locate the .pgpass file* (not to authenticate the username, since our nss-pgsql.conf specifies it explicitly). Fortunately this can be bypassed like:
# service unscd stop
# su admin -c id
<stalls...>
# PGPASSFILE= su admin -c id
uid=20102(admin) gid=100(users) groupes=100(users),10006(tmpl),10007(projecta),1.


So short of debugging unscd, and short of modifying libpq so it stops using getpw* when used from nss, we can set PGPASSFILE in various daemons (apache scm config at least, possibly ssh/shell too).

What do you think?

Cheers!
Sylvain

(gdb) bt
#0  __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f69c465a4b9 in _L_lock_909 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007f69c465a2e0 in __GI___pthread_mutex_lock (mutex=0x7f69b232c160 
<lock>) at ../nptl/pthread_mutex_lock.c:79
#3  0x00007f69b21285d8 in _nss_pgsql_getpwuid_r (uid=1016381, 
result=0x7ffd398ce970, buffer=0x7ffd398ce9a0 "postfix", buflen=8192, 
errnop=0x7f69c53e4710) at interface.c:103
#4  0x00007f69c435dadc in __getpwuid_r (uid=1016381, 
resbuf=resbuf@entry=0x7ffd398ce970, buffer=buffer@entry=0x7ffd398ce9a0 
"postfix", buflen=buflen@entry=8192, result=result@entry=0x7ffd398ce968) at 
../nss/getXXbyYY_r.c:266
#5  0x00007f69b69e3f8a in pqGetpwuid (uid=<optimized out>, 
resultbuf=resultbuf@entry=0x7ffd398ce970, buffer=buffer@entry=0x7ffd398ce9a0 
"postfix", buflen=buflen@entry=8192, result=result@entry=0x7ffd398ce968) at 
thread.c:102
#6  0x00007f69b69d0623 in pqGetHomeDirectory (buf=buf@entry=0x7ffd398d09d0 
"|\022\215\071\375\177", bufsize=bufsize@entry=1024) at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:5863
#7  0x00007f69b69d15e4 in getPgPassFilename (pgpassfile=0x7ffd398d0fe0 "") at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:5813
#8  0x00007f69b69d171d in PasswordFromFile (hostname=0x7f69c65bfb40 
"128.93.193.14", port=0x7f69c65ca660 "5432", dbname=0x7f69c65bfb20 
"fusionforge", username=0x7f69c65bfb00 "fusionforge_nss") at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:5713
#9  0x00007f69b69d1b89 in connectOptions2 (conn=conn@entry=0x7f69c65c3160) at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:803
#10 0x00007f69b69d3918 in PQconnectStart (conninfo=0x7f69c65b6ea0 
"user=fusionforge_nss dbname=fusionforge host=192.XX.XX.XX") at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:657
#11 0x00007f69b69d394e in PQconnectdb (conninfo=<optimized out>) at 
/build/postgresql-9.4-CPUoxi/postgresql-9.4-9.4.5/build/../src/interfaces/libpq/fe-connect.c:513
#12 0x00007f69b2129152 in backend_open (type=type@entry=110 'n') at backend.c:95
#13 0x00007f69b2128553 in _nss_pgsql_getpwnam_r (pwnam=0x7f69c509bb38 
"xxxxx_wpj", result=0x7f69c464ce00 <resbuf>, buffer=0x7f69c60f5ff0 "postfix", 
buflen=1024, errnop=0x7f69c53e4710) at interface.c:85
#14 0x00007f69c435d84d in __getpwnam_r (name=name@entry=0x7f69c509bb38 
"xxxxx_wpj", resbuf=resbuf@entry=0x7f69c464ce00 <resbuf>, buffer=0x7f69c60f5ff0 
"postfix", buflen=1024, result=result@entry=0x7ffd398d1558) at 
../nss/getXXbyYY_r.c:266
#15 0x00007f69c435d1df in getpwnam (name=0x7f69c509bb38 "xxxxx_wpj") at 
../nss/getXXbyYY.c:116
#16 0x00007f69c11a5531 in ?? () from /usr/lib/apache2/modules/mpm_itk.so
#17 0x00007f69c51aac00 in ap_run_post_perdir_config (r=0x7f69c50990a0) at 
request.c:96
#18 0x00007f69c51ad0c8 in ap_process_request_internal (r=0x7f69c50990a0) at 
request.c:237
#19 0x00007f69c51ca670 in ap_process_async_request (r=0x7f69c50990a0) at 
http_request.c:315
#20 0x00007f69c51ca820 in ap_process_request (r=0x7f69c50990a0) at 
http_request.c:363
#21 0x00007f69c51c7122 in ap_process_http_sync_connection (c=0x7f69c50a1290) at 
http_core.c:190
#22 ap_process_http_connection (c=0x7f69c50a1290) at http_core.c:231
#23 0x00007f69c51bdb10 in ap_run_process_connection (c=0x7f69c50a1290) at 
connection.c:41
#24 0x00007f69c11a5adb in itk_fork_process () from 
/usr/lib/apache2/modules/mpm_itk.so
#25 0x00007f69c51bdb10 in ap_run_process_connection (c=0x7f69c50a1290) at 
connection.c:41
#26 0x00007f69c0b937ba in child_main (child_num_arg=-1305296544) at 
prefork.c:704
#27 0x00007f69c0b93a01 in make_child (s=0x7f69c53b0de0, slot=5) at prefork.c:800
#28 0x00007f69c0b94667 in perform_idle_server_maintenance (p=<optimized out>) 
at prefork.c:902
#29 prefork_run (_pconf=0x7f69c53faf38 <ap_server_conf>, plog=0x7ffd398d18ac, 
s=0x7ffd398d18b0) at prefork.c:1090
#30 0x00007f69c5199e7e in ap_run_mpm (pconf=0x7f69c53e2028, 
plog=0x7f69c53b6028, s=0x7f69c53b0de0) at mpm_common.c:94
#31 0x00007f69c51933c3 in main (argc=3, argv=0x7ffd398d1b98) at main.c:777
_______________________________________________
Fusionforge-general mailing list
Fusionforge-general@lists.fusionforge.org
http://lists.fusionforge.org/cgi-bin/mailman/listinfo/fusionforge-general

Reply via email to