Re: PR 15282 AcceptEx problem

2003-03-03 Thread Allan Edwards
William A. Rowe, Jr. wrote:
Just to summarize, there are three conditions we need to consider:
1) we hit the TransmitFile recycle bug many times in a row
2) we have encountered an incompatible firewall or VPN
3) the IP address has changed


You seem to have the failcases easily reproduced.  Would you tack in
some quick code that simply uses getsockopt(foo) (any option you like)
to see if simply getting socket options for a now-broken listen socket
will fail?  
Actually I have not been able to reproduce the AcceptEx error
for 3), however I think the following will address all three
cases and introduces the WindowsSocketsWorkaround directive:
Index: mpm/winnt/child.c
===
RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v
retrieving revision 1.13
diff -u -d -b -r1.13 child.c
--- mpm/winnt/child.c   28 Feb 2003 14:02:42 -  1.13
+++ mpm/winnt/child.c   3 Mar 2003 22:31:15 -
@@ -498,7 +498,7 @@
 PCOMP_CONTEXT context = NULL;
 DWORD BytesRead;
 SOCKET nlsd;
-int rv;
+int rv, err_count = 0;
 apr_os_sock_get(nlsd, lr-sd);

@@ -538,15 +538,38 @@
 rv = apr_get_netos_error();
 if ((rv == APR_FROM_OS_ERROR(WSAEINVAL)) ||
 (rv == APR_FROM_OS_ERROR(WSAENOTSOCK))) {
-/* Hack alert. Occasionally, TransmitFile will not recycle the
- * accept socket (usually when the client disconnects early).
- * Get a new socket and try the call again.
+/* Hack alert, we can get here because:
+ * 1) Occasionally, TransmitFile will not recycle the accept socket
+ *(usually when the client disconnects early).
+ * 2) There is VPN or Firewall software installed with buggy AcceptEx 
implementation
+ * 3) The webserver is using a dynamic address and it has changed
  */
+Sleep(0);
+if (++err_count  1000) {
+apr_int32_t disconnected;
+
+/* abitrary socket call to test if the Listening socket is still 
valid */
+apr_status_t listen_rv =  apr_socket_opt_get(lr-sd, 
APR_SO_DISCONNECTED, disconnected);
+
+if (listen_rv == APR_SUCCESS) {
+ap_log_error(APLOG_MARK,APLOG_ERR, listen_rv, ap_server_conf,
+ AcceptEx error: If this occurs constantly and NO 
requests are being served 
+ try using the WindowsSocketsWorkaround directive set 
to 'on'.);
+err_count = 0;
+}
+else {
+ap_log_error(APLOG_MARK,APLOG_ERR, listen_rv, ap_server_conf,
+ The Listening socket is no longer valid. Dynamic 
address changed?);
+break;
+}
+}
+
 closesocket(context-accept_socket);
 context-accept_socket = INVALID_SOCKET;
 ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
-   winnt_accept: AcceptEx failed due to early client 
-   disconnect. Reallocate the accept socket and try again.);
+   winnt_accept: AcceptEx failed, either early client disconnect, 

+   dynamic address renewal, or incompatible VPN or Firewall 
software.);
+
 continue;
 }
 else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) 
@@ -558,6 +581,7 @@
 Sleep(100);
 continue;
 }
+err_count = 0;
 /* Wait for pending i/o.
  * Wake up once per second to check for shutdown .
@@ -701,7 +725,7 @@
 ap_update_child_status_from_indexes(0, thread_num, SERVER_READY, NULL);
 /* Grab a connection off the network */
-if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS) {
+if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS || 
windows_sockets_workaround == 1) {
 context = win9x_get_connection(context);
 }
 else {
@@ -769,7 +793,7 @@
 static void create_listener_thread()
 {
 int tid;
-if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS) {
+if (osver.dwPlatformId == VER_PLATFORM_WIN32_WINDOWS || 
windows_sockets_workaround == 1) {
 _beginthreadex(NULL, 0, (LPTHREAD_START_ROUTINE) win9x_accept,
NULL, 0, tid);
 } else {
@@ -840,7 +864,7 @@
  * Create the worker thread dispatch IOCompletionPort
  * on Windows NT/2000
  */
-if (osver.dwPlatformId != VER_PLATFORM_WIN32_WINDOWS) {
+if (osver.dwPlatformId != VER_PLATFORM_WIN32_WINDOWS  
windows_sockets_workaround != 1) {
 /* Create the worker thread dispatch IOCP */
 ThreadDispatchIOCP = 

Re: PR 15282 AcceptEx problem

2003-02-28 Thread William A. Rowe, Jr.
This patch can't be applied... it actually introduces a denial of service
problem if folks can simply early-disconnect a server some half dozen
times in a row...  It isn't hard to work up such a tool.

Better; what if we test *which* socket failed.  We are sort of helpless 
when the errors could be either the Listen and Accept socket.  If the
error is on the Listen socket, we should exit signaling the parent to do
a restart with new listeners, if the error is on the accept socket we can
just keep going.

More thoughts inline...

At 01:34 PM 2/27/2003, Allan Edwards wrote:
As far as I can tell this is a bug in the Sprint
PCS Connect support for AcceptEx, (they install a
Winsock transport provider called BMI). However, it slips
through our checks and causes the accept thread to
hard loop and consume most of the cpu.

Instead, can we find some patch that will test AcceptEx?  Perhaps we 
create a single local listen and attempt to connect and write to it, test
that the AcceptEx succeeds, and otherwise emit some nasty warnings
and throw a flag that puts us into the Win9x listener code?

What happens is that in get_listeners_from_parent()
WSASocket *succeeeds* using the WSAProtocolInfo from
the parent however, AcceptEx in winnt_accept() fails
with WSAENOTSOCK.

Does accept() also fail?  Can we use the 9x code to work around these
sorts of problems?

I don't see what we can do to fix this but we should
at least avoid hogging the cpu and log an informative
message. Unless there is a better idea I'll commit to 2.1

I don't as much mind the Sleep(100) or even Sleep(0) so that we
relinquish clock cycles.  It's the arbitrary foil the server 100 times
and it will exit problem.

16327 may be related but I haven't been able to recreate
the problem with BlackIce or Norton Personal Firewall.

Because those only occur once the listen socket becomes
invalidated, due to DHCP or some other change.  You can trigger
by reconfiguring TCP/IP to switch between two IP addresses.
Again, we can recover gracefully if we ask the parent to do 
a respawn upon recreating all of *it's* listeners.

If the parent can test that the listeners are healthy (with some
simple setsockopt call) then we can just leave it to the child to
exit.  As long as the parent performs a listener health check
before each child process spawn, we should be much better off
than we are today.

Bill 



Re: PR 15282 AcceptEx problem

2003-02-28 Thread Allan Edwards
William A. Rowe, Jr. wrote:
This patch can't be applied... it actually introduces a denial of service
problem if folks can simply early-disconnect a server some half dozen
actually 100  :)

times in a row...  It isn't hard to work up such a tool.
If it is possible for someone to externally tickle the TransmitFile socket 
recycle bug then I agree.

Better; what if we test *which* socket failed.  We are sort of helpless 
when the errors could be either the Listen and Accept socket.  If the
error is on the Listen socket, we should exit signaling the parent to do
a restart with new listeners, if the error is on the accept socket we can
just keep going.
Based on the IP address renewal scenario you mention below, testing the Listen
socket (somehow, tbd) sounds like a good idea.
Just to summarize, there are three conditions we need to consider:
1) we hit the TransmitFile recycle bug many times in a row
2) we have encountered an incompatible firewall or VPN
3) the IP address has changed
Instead, can we find some patch that will test AcceptEx?  Perhaps we 
create a single local listen and attempt to connect and write to it, test
that the AcceptEx succeeds, and otherwise emit some nasty warnings
and throw a flag that puts us into the Win9x listener code?
Testing AcceptEx is not easy, the failure only occurs when duplicating
the socket between processes. But maybe testing the Listen socket
provides us with enough information to indicate what the problem might be
and suggest or perform corrective action.
Does accept() also fail?  Can we use the 9x code to work around these
sorts of problems?
No, accept() is fine. Using the 9x path *may* work but I haven't
tested it. The other option Bill S. suggested was to add a directive
that forces the 9x path. I tend to think that is preferable than a
run time decision because I'm not sure we can reliably determine
which path to take at runtime.
Note: taking the 9x path is only relevant to case 2) above.
I don't as much mind the Sleep(100) or even Sleep(0) so that we
relinquish clock cycles.  It's the arbitrary foil the server 100 times
and it will exit problem.
OK, so we can log a msg  continue instead of exiting.

Since we may not be able to guarantee a false positive
maybe we should modify the error message and say that
if NO requests are being served it is probably a firewall
or VPN problem, but continue the accept loop.
However, prior to logging this message we would need to test the Listen
socket and, if it is bad, log a message saying that the IP address has probably 
become invalid, then exit the child and let the parent renew the Listeners.

Because those only occur once the listen socket becomes
invalidated, due to DHCP or some other change.  You can trigger
by reconfiguring TCP/IP to switch between two IP addresses.
Again, we can recover gracefully if we ask the parent to do 
a respawn upon recreating all of *it's* listeners.
i.e. whenever we hit some threshold of consecutive AcceptEx errors
test the Listening socket (tbd somehow), and exit the child if it is bad.
Allan



PR 15282 AcceptEx problem

2003-02-27 Thread Allan Edwards
As far as I can tell this is a bug in the Sprint
PCS Connect support for AcceptEx, (they install a
Winsock transport provider called BMI). However, it slips
through our checks and causes the accept thread to
hard loop and consume most of the cpu.
What happens is that in get_listeners_from_parent()
WSASocket *succeeeds* using the WSAProtocolInfo from
the parent however, AcceptEx in winnt_accept() fails
with WSAENOTSOCK.
I don't see what we can do to fix this but we should
at least avoid hogging the cpu and log an informative
message. Unless there is a better idea I'll commit to 2.1
16327 may be related but I haven't been able to recreate
the problem with BlackIce or Norton Personal Firewall.
Allan

Index: child.c
===
RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v
retrieving revision 1.12
diff -u -d -b -r1.12 child.c
--- child.c 26 Feb 2003 21:55:54 -  1.12
+++ child.c 27 Feb 2003 16:38:59 -
@@ -498,7 +498,7 @@
 PCOMP_CONTEXT context = NULL;
 DWORD BytesRead;
 SOCKET nlsd;
-int rv;
+int rv, err_count = 0;
 apr_os_sock_get(nlsd, lr-sd);

@@ -547,6 +547,14 @@
 ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
winnt_accept: AcceptEx failed due to early client 
disconnect. Reallocate the accept socket and try again.);
+
+Sleep(100);
+if (++err_count  100) {
+ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf,
+ AcceptEx unrecoverable error, 
+ possibly incompatible firewall or VPN software is 
installed.);
+break;
+}
 continue;
 }
 else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) 
@@ -558,6 +566,7 @@
 Sleep(100);
 continue;
 }
+err_count = 0;
 /* Wait for pending i/o.
  * Wake up once per second to check for shutdown .
Index: child.c
===
RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v
retrieving revision 1.12
diff -u -d -b -r1.12 child.c
--- child.c 26 Feb 2003 21:55:54 -  1.12
+++ child.c 27 Feb 2003 16:38:59 -
@@ -498,7 +498,7 @@
 PCOMP_CONTEXT context = NULL;
 DWORD BytesRead;
 SOCKET nlsd;
-int rv;
+int rv, err_count = 0;
 
 apr_os_sock_get(nlsd, lr-sd);
 
@@ -547,6 +547,14 @@
 ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
winnt_accept: AcceptEx failed due to early client 
disconnect. Reallocate the accept socket and try again.);
+ 
+Sleep(100);
+if (++err_count  100) { 
+ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf,
+ AcceptEx unrecoverable error, 
+ possibly incompatible firewall or VPN software is 
installed.);
+break;
+}   
 continue;
 }
 else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) 
@@ -558,6 +566,7 @@
 Sleep(100);
 continue;
 }
+err_count = 0;  
 
 /* Wait for pending i/o. 
  * Wake up once per second to check for shutdown .


Re: PR 15282 AcceptEx problem

2003-02-27 Thread Bill Stoddard
Humm... how do our friends at MS solve this in IIS?

Bill

Allan Edwards wrote:
As far as I can tell this is a bug in the Sprint
PCS Connect support for AcceptEx, (they install a
Winsock transport provider called BMI). However, it slips
through our checks and causes the accept thread to
hard loop and consume most of the cpu.
What happens is that in get_listeners_from_parent()
WSASocket *succeeeds* using the WSAProtocolInfo from
the parent however, AcceptEx in winnt_accept() fails
with WSAENOTSOCK.
I don't see what we can do to fix this but we should
at least avoid hogging the cpu and log an informative
message. Unless there is a better idea I'll commit to 2.1
16327 may be related but I haven't been able to recreate
the problem with BlackIce or Norton Personal Firewall.
Allan

Index: child.c
===
RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v
retrieving revision 1.12
diff -u -d -b -r1.12 child.c
--- child.c26 Feb 2003 21:55:54 -1.12
+++ child.c27 Feb 2003 16:38:59 -
@@ -498,7 +498,7 @@
 PCOMP_CONTEXT context = NULL;
 DWORD BytesRead;
 SOCKET nlsd;
-int rv;
+int rv, err_count = 0;
 apr_os_sock_get(nlsd, lr-sd);

@@ -547,6 +547,14 @@
 ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
winnt_accept: AcceptEx failed due to early 
client 
disconnect. Reallocate the accept socket and 
try again.);
+
+Sleep(100);
+if (++err_count  100) {
+ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf,
+ AcceptEx unrecoverable error, 
+ possibly incompatible firewall or VPN 
software is installed.);
+break;
+}
 continue;
 }
 else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) 
@@ -558,6 +566,7 @@
 Sleep(100);
 continue;
 }
+err_count = 0;

 /* Wait for pending i/o.
  * Wake up once per second to check for shutdown .


Index: child.c
===
RCS file: /home/cvs/httpd-2.0/server/mpm/winnt/child.c,v
retrieving revision 1.12
diff -u -d -b -r1.12 child.c
--- child.c	26 Feb 2003 21:55:54 -	1.12
+++ child.c	27 Feb 2003 16:38:59 -
@@ -498,7 +498,7 @@
 PCOMP_CONTEXT context = NULL;
 DWORD BytesRead;
 SOCKET nlsd;
-int rv;
+int rv, err_count = 0;
 
 apr_os_sock_get(nlsd, lr-sd);
 
@@ -547,6 +547,14 @@
 ap_log_error(APLOG_MARK, APLOG_DEBUG, rv, ap_server_conf,
winnt_accept: AcceptEx failed due to early client 
disconnect. Reallocate the accept socket and try again.);
+ 
+Sleep(100);
+if (++err_count  100) { 
+ap_log_error(APLOG_MARK,APLOG_ERR, rv, ap_server_conf,
+ AcceptEx unrecoverable error, 
+ possibly incompatible firewall or VPN software is installed.);
+break;
+}   
 continue;
 }
 else if ((rv != APR_FROM_OS_ERROR(ERROR_IO_PENDING)) 
@@ -558,6 +566,7 @@
 Sleep(100);
 continue;
 }
+err_count = 0;  
 
 /* Wait for pending i/o. 
  * Wake up once per second to check for shutdown .




Re: PR 15282 AcceptEx problem

2003-02-27 Thread Allan Edwards
Bill Stoddard wrote:
Humm... how do our friends at MS solve this in IIS?
It only happens because of our parent-child process
model. If you run -X the problem goes away. It's the
socket duplication that seems to bite us.
Allan



Re: PR 15282 AcceptEx problem

2003-02-27 Thread Bill Stoddard
Allan Edwards wrote:
Bill Stoddard wrote:

Humm... how do our friends at MS solve this in IIS?


It only happens because of our parent-child process
model. If you run -X the problem goes away. It's the
socket duplication that seems to bite us.
Allan

Perhaps we need a winnt mpm directive to force the server to use the 
Win9* accept code path. Whould be a terrible thing to do on a production 
level server (for performance reasons) but quite okay for most of the 
folks that are seeing personal firewalls collide with our use of AcceptEx.

Bill



Re: PR 15282 AcceptEx problem

2003-02-27 Thread Allan Edwards
Perhaps we need a winnt mpm directive to force the server to use the 
Win9* accept code path. Whould be a terrible thing to do on a production 
level server (for performance reasons) but quite okay for most of the 
folks that are seeing personal firewalls collide with our use of AcceptEx.


mmm... that might work. PCS Connect has no problem with the accept() call.

Allan