Possible infinite loop in TThreadPoolServer
-------------------------------------------

                 Key: THRIFT-1493
                 URL: https://issues.apache.org/jira/browse/THRIFT-1493
             Project: Thrift
          Issue Type: Bug
          Components: Java - Library
    Affects Versions: 0.7
         Environment: Debian Squeeze
            Reporter: bert Passek


I just faced a major problem in Thrift in combination with Flume, but the 
problem actually could be tracked down to the Thrift library.

I'm using Thrift in a typical client/server environment for tracking tons of 
data. We ran into an exception which basically looks like:

2012-01-11 14:57:30,487 ERROR com.cloudera.flume.core.connector.DirectDriver: 
Exiting driver logicalNode newsletterImpressionLog01-21 in error state 
ThriftEventSource | CassandraSink because sleep interrupted 
2012-01-11 17:18:14,808 WARN org.apache.thrift.server.TSaneThreadPoolServer: 
Transport error occurred during acceptance of message. 
org.apache.thrift.transport.TTransportException: java.net.SocketException: Too 
many open files 
        at 
org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
 
        at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) 
        at 
org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
 
Caused by: java.net.SocketException: Too many open files 
        at java.net.PlainSocketImpl.socketAccept(Native Method) 
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) 
        at java.net.ServerSocket.implAccept(ServerSocket.java:462) 
        at java.net.ServerSocket.accept(ServerSocket.java:430) 
        at 
org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
 
        ... 2 more 
2012-01-11 17:18:14,809 WARN org.apache.thrift.server.TSaneThreadPoolServer: 
Transport error occurred during acceptance of message. 
org.apache.thrift.transport.TTransportException: java.net.SocketException: Too 
many open files 
        at 
org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
 
        at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) 
        at 
org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
 
Caused by: java.net.SocketException: Too many open files 
        at java.net.PlainSocketImpl.socketAccept(Native Method) 
        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) 
        at java.net.ServerSocket.implAccept(ServerSocket.java:462) 
        at java.net.ServerSocket.accept(ServerSocket.java:430) 
        at 
org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
 
        ... 2 more 

Note: Flume is using their own implementation of TThreadPoolServer which is 
literally copied and pasted from original source code from Thrift. Flume 
embedded this part of thrift library in a massive multi-threading environment.

I was running out of socket connection indicated by exception "too many open 
files". This exception causes an infinite loop in this part of method serve():

while (!stopped_) {
      int failureCount = 0;
      try {
        TTransport client = serverTransport_.accept();
        WorkerProcess wp = new WorkerProcess(client);
        executorService_.execute(wp);
      } catch (TTransportException ttx) {
        if (!stopped_) {
          ++failureCount;
          LOGGER.warn("Transport error occurred during acceptance of message.", 
ttx);
        }
      }
    }

Furthermore in an overnight process i was running out of disk space because the 
logged exceptions were increasing the size of the log file dramatically. There 
was no way of recovery.

If there are any critical exceptions the while-loop will never be stopped. This 
can only be done by calling stop() method.

The question is how to handle such exceptions as described above in general? I 
can't even catch an exception because the exception is just logged but not 
handled in any way. So there is no way of reacting for doing some cleanup or 
restarting the server for example.

Best Regards 

Bert Passek

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to