Daniel, I have committed fix into CVS HEAD. Could you try it out? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
> Thanks to you Tatsuo,... > > I am glad my report helped you! As soon as you tell me it is done I will > start testing/debugging it again. > > 2010/3/2 Tatsuo Ishii <[email protected]> > > > Daniel, > > > > Thanks for the report! > > > > > I spoke in another message about this problem, yet, I debugged deeper and > > I > > > have more specific information, that, maybe, can be usefull. > > > (The thread I spoke something about was: > > > > > http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html > > > ) > > > > > > I am working with two VB Virtual machines with CentOS 5 (i386). Running > > > PostgreSQL 8.3.9 and pgpool 2.3.2.1. > > > > > > The test was simple. While I was inserting values every second, I > > unplugged > > > one of the nodes. > > > health check is every second and it's timeout is 2 seconds. > > > > > > In that moment all inserts stops, and pgpool waits. > > > The point where it stops is: > > > > > > [...] > > > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9 > > > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = > > 0 > > > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET, > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16) = ? > > > ERESTARTSYS (To be restarted) > > > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0 (0) --- > > > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT > > BUS > > > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0 > > > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now []) > > > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET, > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16 <unfinished > > > ...> > > > > > > > > > First it does a connect() wich receives de SIGALARM, and continues. But > > then > > > it does another connect(), and this time it does not receive any > > SIGALARM, > > > so, it waits (I think) till the system closes the connection. > > > > > > After waiting (too long) it starts working again (now with the node > > down): > > > > > > [...] > > > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1 EHOSTUNREACH (No > > > route to host) > > > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1 EHOSTUNREACH (No > > > route to host) > > > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT > > BUS > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0 > > > [pid 29445] 10:49:30.275057 time(NULL) = 1267436970 > > > [pid 29445] 10:49:30.275202 stat64("/etc/localtime", > > {st_mode=S_IFREG|0644, > > > st_size=2593, ...}) = 0 > > > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30 ERROR: pid > > 2"..., > > > 1012010-03-01 10:49:30 ERROR: pid 29445: connect_inet_domain_socket: > > > connect() failed: No route to host > > > ) = 101 > > > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 > > > [pid 29445] 10:49:30.276062 close(7) = 0 > > > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT > > BUS > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0 > > > [pid 29445] 10:49:30.276389 time(NULL) = 1267436970 > > > [pid 29445] 10:49:30.276715 stat64("/etc/localtime", > > {st_mode=S_IFREG|0644, > > > st_size=2593, ...}) = 0 > > > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30 ERROR: pid > > 2"..., > > > 782010-03-01 10:49:30 ERROR: pid 29445: connection to 192.168.1.10(5432) > > > failed > > > ) = 78 > > > [...] > > > > > > As you can see it restarts after 1 min and a half (wich is too much). It > > is > > > always the same (without changeing any system values) > > > > > > If it is necessary I can show more debug lines. > > > > > > Looking trough the source, we think, maybe it could be a problem with the > > > connection being blocked. Maybe, it would be possible not to block it > > > (speaking about the socket). > > > We suppose something is happening in pool_connection_pool.c arround line > > 473 > > > ("connect_inet_domain_socket_by_port"). > > > > > > Or maybe I am doing something wrong,... does anybody else tested the > > > "unpluged wire" ? Is it working? > > > > What health_check() does here is: > > > > start alarm (done by caller of health_check) > > connect() > > write() > > read() > > : > > : > > > > If the wire is unplugged, one of system calls will be blocked and > > eventually alarm interrupt any of connect/write/read and health_check > > returns with error code. Write() and read() are fine. Problem is, > > connect is done by connect_inet_domain_socket_by_port, which does > > retry if connect() is interrupted by a system call. > > > > I belive what you saw was that. > > > First it does a connect() wich receives de SIGALARM, and continues. But > > then > > > it does another connect(), and this time it does not receive any > > SIGALARM, > > > so, it waits (I think) till the system closes the connection. > > > > The retry should be turn off if it's called from health_check(). Will > > fix. > > > > Thanks again for good testing and analysis. > > -- > > Tatsuo Ishii > > SRA OSS, Inc. Japan > > English: http://www.sraoss.co.jp/index_en.php > > Japanese: http://www.sraoss.co.jp > > _______________________________________________ Pgpool-general mailing list [email protected] http://pgfoundry.org/mailman/listinfo/pgpool-general
