Hi Dan I noticed exactly the same symptoms. It seems to be a problem with the number of threads being allowed in the abyss web server. When your program hangs and you do a 'netstat' you will notice 16 connections locked in the FIN_WAIT_2 stage. There is a hard limit setting in abyss.h to increase this number but this doesn’t fix the root cause (merely prolonging it :D ).
Having a quick look through the code revealed some insight: It seems that to cater for asynchronous requests the server is setup to create connections and leave them open until a background thread (syn. Garbage collector) comes around and cleans up the completed transaction connections. I didn't have too much time to poke around but it seems that this bg thread either takes a long time or doesn’t do its job at all. The server then blocks until it has available connections to dispatch again. I will try and check in detail tomorrow if I have time. In the meantime I was able to stabilise my server by changing the transaction model to be foreground (sync) processes (good enough for my purposes). You can change this in xmlrpc_server.c (I think its in the RunServer2 fn). After this change - I was able to process millions of transactions without any lockups / errors. Hope this helps / adds insight into a bug fix. Cheers Jason Whenever I run a lot of xmlrpc requests, after a while the xmlrpc process becomes unresponsive. It is still there, it doesn't crash, the rest of openser works as expected, but the xmlrpc process simply doesn't answer any request anymore until restarted. The number of requests I have to issue to make this happen is variable. Sometimes is takes only 100 requests, other times is need up to 20000, but the end result is always the same. The xmlrpc request I ran is refreshWatchers(account, 'presence', 0) called repeatedly until it no longer receives a reply. The account is a subscriber that is present in the system. But I also tried pwd and ps and got the same result. Below is the strace right before and while it got stuck. As it can be seen it got stuck in futex() which seems to indicate some sort of locking mechanism that has entered a deadlock. The futex is not called before that point in the whole strace output, and when it is called it locks. --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4353 close(9) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(8, {sa_family=AF_INET, sin_port=htons(38926), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4354 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38928), sin_addr=inet_addr("10.0.0.146")}, [16]) = 9 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4354 close(8) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(9, {sa_family=AF_INET, sin_port=htons(38928), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4355 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38930), sin_addr=inet_addr("10.0.0.146")}, [16]) = 8 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4355 close(9) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(8, {sa_family=AF_INET, sin_port=htons(38930), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4356 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38932), sin_addr=inet_addr("10.0.0.146")}, [16]) = 9 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4356 close(8) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(9, {sa_family=AF_INET, sin_port=htons(38932), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4357 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38934), sin_addr=inet_addr("10.0.0.146")}, [16]) = 8 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4357 close(9) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(8, {sa_family=AF_INET, sin_port=htons(38934), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4358 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38936), sin_addr=inet_addr("10.0.0.146")}, [16]) = 9 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4358 close(8) = 0 waitpid(-1, 0xbfdcca6c, WNOHANG) = -1 ECHILD (No child processes) sigreturn() = ? (mask now []) getpeername(9, {sa_family=AF_INET, sin_port=htons(38936), sin_addr=inet_addr("10.0.0.146")}, [16]) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID| SIGCHLD, child_tidptr=0xb7c69bc8) = 4359 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 accept(6, {sa_family=AF_INET, sin_port=htons(38938), sin_addr=inet_addr("10.0.0.146")}, [16]) = 8 kill(4359, SIG_0) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- waitpid(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG) = 4359 close(9) = 0 futex(0xb7dc9140, FUTEX_WAIT, 2, NULL -- Dan No virus found in this outgoing message. Checked by AVG. Version: 7.5.524 / Virus Database: 269.23.1/1384 - Release Date: 2008/04/17 03:47 PM _______________________________________________ Devel mailing list Devel@lists.openser.org http://lists.openser.org/cgi-bin/mailman/listinfo/devel