I debugged around my occasional failures for TestCoyoteAdapter when
using APR.
Error is:
SEVERE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Poller failed with
error [81] : [File descriptor in bad state]
In the native world it is:
19372/128: port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD
The bad FD here is 0xFFFFFFFF. Normally the FDs put into this slot of
port_associate during the test are small numbers between 6 and 20
(decimal). Here's a -1 and we must have an error before. Wat we would
have expected is FD 11. In the truss snippet, Thread 2 is the client
opening the connection, thread 130 accepts it and thread 133 starts to
work on it:
19372/2: so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "",
SOV_DEFAULT) = 8
Connect
19372/2: connect(8, 0xFDF7DAE0, 16, SOV_DEFAULT) = 0
19372/2: AF_INET name = 127.0.0.1 port = 42932
Accept
19372/130: accept(5, 0xB397F608, 0xB397F5F4, SOV_DEFAULT) = 11
19372/130: AF_INET name = 127.0.0.1 port = 42936
19372/2: getsockname(8, 0xFDF7DAE0, 0xFDF7DA2C, SOV_DEFAULT) = 0
19372/2: AF_INET name = 127.0.0.1 port = 42936
19372/130: fcntl(11, F_GETFD, 0x00000003) = 0
19372/130: fcntl(11, F_SETFD, 0x00000001) = 0
19372/2: setsockopt(8, tcp, TCP_NODELAY, 0xFDF7DE10, 4,
SOV_DEFAULT) = 0
19372/130: lwp_cond_signal(0x0031E3B0) = 0
19372/130: condvar type: USYNC_THREAD
Send request
19372/2: send(8, " G E T / f o o ; j s e".., 227, 0) = 227
19372/13: lwp_cond_wait(0x0031E3B0, 0x0031E398, 0xB457F278, 0) = 0
19372/13: condvar type: USYNC_THREAD
19372/13: mutex type: USYNC_THREAD
19372/13: timeout: 0.999998773 sec
19372/130: lwp_cond_signal(0x004215F0) = 0
19372/130: condvar type: USYNC_THREAD
19372/133: lwp_cond_wait(0x004215F0, 0x004215D8, 0x00000000, 0) = 0
19372/133: condvar type: USYNC_THREAD
19372/133: mutex type: USYNC_THREAD
19372/133: setsockopt(11, tcp, TCP_NODELAY, 0xB3C7F40C, 4,
SOV_DEFAULT) = 0
19372/133: fcntl(11, F_GETFL) = 2
FWRITE
19372/133: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
19372/133: lwp_cond_signal(0x0031E3B0) = 0
19372/133: condvar type: USYNC_THREAD
19372/133: lwp_cond_signal(0x00274448) = 0
19372/133: condvar type: USYNC_THREAD
19372/128: lwp_cond_wait(0x00274448, 0x00274430, 0xB417F580, 0) = 0
19372/128: condvar type: USYNC_THREAD
19372/128: mutex type: USYNC_THREAD
19372/128: timeout: 9.999997356 sec
19372/128: lwp_cond_signal(0x0031E3B0) = 0
19372/128: condvar type: USYNC_THREAD
19372/128: lwp_cond_signal(0x0031E3B0) = 0
19372/128: condvar type: USYNC_THREAD
Adding to the pollset with wrong FD -1 instead of 11.
19372/128: port_associate(6, 4, 0xFFFFFFFF, 0x00000001, 0x0089ABE8)
Err#81 EBADFD
I added some Java log statements to track calls to the Poll class
(please ignore that hey are log level warn):
22-Sep-2013 12:19:12.909 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]
22-Sep-2013 12:19:12.910 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2218784
22-Sep-2013 12:19:12.911 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
Here begins the root cause:
- start polling poller of size 1, containing only socket 2218784:
22-Sep-2013 12:19:12.912 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000
- start removing from the same poller concurrently in the exec thread:
22-Sep-2013 12:19:12.914 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 of size 1 socket 2856392
- polling returns with one fd and decreases size to 0:
22-Sep-2013 12:19:12.913 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with 1
- remove returns with success and decreases size to -1! :
22-Sep-2013 12:19:12.915 WARNING [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.removeFromPoller Removing
from poller number 0 returned with 0
22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Processing socket
[2,218,784] for event(s) [1]
22-Sep-2013 12:19:12.916 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,856,392],
doIt [true]
22-Sep-2013 12:19:12.929 FINE [http-apr-127.0.0.1-auto-13-exec-10]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,218,784],
timeout [3,000], flags [1]
22-Sep-2013 12:19:12.930 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,218,784]
Here the poller data structure is corrupt: it says size is -1!
The value printed is "actualPollerSize-pollerSpace[i]", so
pollerSpace[i] was incremented once to many.
22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size -1 socket 2218784
22-Sep-2013 12:19:12.931 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.935 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.maintain Timing out [2,218,784
22-Sep-2013 12:19:22.937 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint.destroySocket socket [2,218,784],
doIt [true]
22-Sep-2013 12:19:22.940 FINE [http-apr-127.0.0.1-auto-13-Acceptor-0]
org.apache.tomcat.util.net.AprEndpoint.processSocketWithOptions socket
[2,848,192]
22-Sep-2013 12:19:22.947 FINE [http-apr-127.0.0.1-auto-13-exec-2]
org.apache.tomcat.util.net.AprEndpoint$Poller.add socket [2,848,192],
timeout [3,000], flags [1]
22-Sep-2013 12:19:22.948 FINE [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run socket [2,848,192]
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 of size 0 socket 2848192
22-Sep-2013 12:19:22.949 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.addToPoller Adding to
poller number 0 returned with 0
22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 of size 1 with timeout 2000
Here we get EBADFD (-81 on Solaris):
22-Sep-2013 12:19:22.950 WARNING [http-apr-127.0.0.1-auto-13-Poller]
org.apache.tomcat.util.net.AprEndpoint$Poller.run Polling poller number
0 returned with -81
Not sure whether the problem is more in the concurrent poll plus remove,
or the following code in poll.c:
TCN_IMPLEMENT_CALL(jint, Poll, remove)(TCN_STDARGS, jlong pollset,
jlong socket)
{
apr_pollfd_t fd;
apr_status_t rv;
tcn_pollset_t *p = J2P(pollset, tcn_pollset_t *);
tcn_socket_t *s = J2P(socket, tcn_socket_t *);
UNREFERENCED_STDARGS;
TCN_ASSERT(socket != 0);
if (s->pe == NULL) {
/* Already removed */
return APR_SUCCESS;
}
Here we return APR_SUCCESS and the code calling Poll.remove in
AprEndpoint always does:
rv = Poll.remove(pollers[i], socket);
if (rv != Status.APR_NOTFOUND) {
pollerSpace[i]++;
connectionCount--;
break;
}
So the pollerSpace and connectionCount numbers are (in/de)cremented.
I don't have a real solution yet, because I'm not sure which of the two
observations is the real culprit, concurrency or "wrong" return code.
Regards,
Rainer
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]