The problem was subtle -- but odd none the less.
The error was that (and if anyone keeps a careful look out they will not be surprised for me to indicate this) I was leaking open file descriptors over time. Eventually either the program or system ran out. However, the odd thing was that connect() did NOT return an error -- that I check. Instead, it appears on W2K that there is some odd lazy allocation of file descriptors going on in that I did not get an error until I actually "used" the file descriptor.
A IO::Select() is NOT a use it appears since that worked (until it didn't.. :-). By use I mean either a write() or when there was data "available" -- then the file descriptor turned "bad" (which in my case could happen 5-10 seconds in real-time later). Perhaps it was the allocation of some sort of psuedo-tty, but that is more Unix like and I am not sure how W2K handles sockets at that level.
In any event, something to be aware of.
Cheers
-Steve
On Thursday, March 13, 2003, at 09:31 AM, Steven McDowall wrote:
Sorry for the repost -- but I realized that a message with no subject wasn't about to engender much in the way of help nor tracking the thread!
---- Original Help Request ----
If anyone has some ideas here, I would really appreciate it. I've been tracking this down for over 2 weeks and am now bald! :-)
Background
------------------
I have a program who sets up a normal server socket to take connections. In general,for a client, it will then start up another program and connect to the newly started program via a socket. At some point the client will request the termination or restart of the "controlled" program -- and ask for another to be started.. Over and over - for money. :-)
The program uses a pretty sophisticed event_loop pattern based on the Msg.pm/RPC.pm stuff in the advanced perl book. At the lowest level we of course have a IO::Select() call to see what to do.
The normal condition is to have 1 or 3 read_handles (1 for the master socket, 1 for the client connection, and 1 for the socket to the controlled program).
I am running Perl 5.6.1 :-)
The Problem ---------------------
The above method works great -- most of the time. The program can (and does) run for days on end. However, lately due to the load being sent, it is starting / ending a LOT of controlled programs in a relatively short amount of time .. like 100 in a 3 hour window. These are still all being done single-threaded (as it were) where there is no more than one controlled program or client at a time, but basically the controlled program will start -- run for a bit and crash -- causing us to clean up and restart the controlled program.
Then, 'out of the blue', I started to see a hard cpu loop -- a little debugging lead to the fact that I was getting an error on the select() call -- Bad File Handle. Bad file handle I said to myself?? WTF??
Tracking the problem ---------------------------------
In the following two weeks (leading up to this) I have done a lot of debugging and am stumped.
I listed and dumped the read_handles I was passing in to select(). They all looked like the ones I was expecting.
It was NOT the first select() after creating the socket! In every instance there were at least 2-3 prior selects (that timed out) that did NOT have the Bad File Handle.
I created a "EBADF" routine -- which examined each file handle to figure out the bad boy (no other way to know which handle had the problem !). Each file handle PASSED the handle->open() question which indicates if the handle is a valid handle!!!
The bad handle is ALWAYS the socket() connection to the started up program. This was verified because it failed a can_read() call with the same error.
I can ->read() on the "bad" file handle without any error. No data, but no error either.
Once in this state nothing I have tried has cleared it.
The started program is still running and sitting there pretty as a picture waiting for something to happen on the socket.
Prior to the error - I have sent data (write()) to the socket without any error either.
The program always starts on the same machine as the main program.
This is all windows 2K Pro
The reason I had an endless loop is because I "ignored" the EBADF and went back to the select, who immediately of course returned the same error.. filling the log but not doing anything. : That has since been fixed. :-)
The "timing" for the error (since it is NOT the first select) seems (gut feeling) to be when the socket WOULD actually have data available on the socket to be read.
It is NOT always the same # of restarts, connections.. sometimes its 50 times, sometimes 150, sometimes 200! etc.
I tried clearing the error bit : handle->clear_error() - nada.
I am very confused and frustrated. Any ideas?
Thanks!
-Sj McDowall
_______________________________________________ Perl-Win32-Users mailing list [EMAIL PROTECTED] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs