Hi, >>>> >>>>> ntdll.dll!NtWaitForMultipleObjects+0xc >>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a >>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed >>>>> postgres.exe!pgwin32_recv+0x90 >>>>> postgres.exe!PgstatCollectorMain+0x17f >>>>> postgres.exe!SubPostmasterMain+0x33a >>>>> postgres.exe!main+0x168 >>>>> postgres.exe!__tmainCRTStartup+0x10f >>>>> kernel32.dll!BaseProcessStart+0x23 >>>> >>>> I have seen this problem too. The process seems stuck for no good >>>> reason. I wondered at the time if it could be a kernel issue. I >>>> remember trying to send some data to the collector to verify whether >>>> it'd wake up, but no luck. (I mean I couldn't find a way to do it on >>>> Windows). >>> >>> I have seen this as well, but only in cases where there has been >>> broken firewall software or such things involved. I have seen a couple >>> of reports from the field though. >>> >>> Anyway, this really is a should-never-happen thing. As soon as a new >>> packet is sent in, WaitForMultipleObjectsEx() should return right >>> away. And given that backends regularly send packets over, it >>> shouldn't be an issue even if we miss one... >>> >> >> And this fact should lend credence to Alvaro's (as well as mine) >> suspicions that it seems to be a Windows kernel issue. >> >> As a consequence, Magnus I was wondering if having a loop similar to >> the WRITE handling of waiting for a fixed timeout in a loop (rather >> than an INFINITE call to WaitForMultipleObjectsEx) inside the >> pgwin32_waitforsinglesocket() function will help for the READ case >> too? I believe Teogor Sigaev had raised a similar concern a while back >> about it: >> >> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html > > Maybe. I'm unsure if it's enough to just try another > WaitForSingleObjectEx() on it, or if we need to actually issue a > WSARecv() on it as well. Maybe it would be enough to just change the > INIFINTE on line 318 of socket.c to a fixed value. That will loop down > to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us > to do a short sleep and come back. But we'd have to change the limit > of 5 somehow then, since in theory we should wait forever. Maybe that > outer loop should just be a for(;;), what do you think? >
Yes, line 318 seems to be a much better location to me. If Windows and it's socket logic behaves properly most of the times :), most of the calls should come out within the first few tries, so changing 5 to an infinite loop shouldn't hurt those normal use cases in theory. OTOH, I was wondering what if we kill the stats collector and on a restart the socket communication resumes properly. Would that conclusively mean that it is a flaw in our code? Regards, Nikhils > From what I understand, none of you have an environment where you can > reliably reproduce this? That means it's going to be a PITA to try to > figure out if we're actually fixing anything :S > > > -- > Magnus Hagander > Self: http://www.hagander.net/ > Work: http://www.redpill-linpro.com/ > -- http://www.enterprisedb.com -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs