We believe that we've made some progress in identifying the problem. It appears that we have a slow socket connection leak on the Collector node due to sparse data coming in on some Thrift RPC sources. Turns out we're going through a firewall, and we believe that it is killing those inactive connections.
The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket timeout on a subsequent append, but the Collector still has its socket connections open and they don't appear to ever be timing out and closing. I found the following which seems to describe the problem: http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3c1311642202.14311.2155844...@webmail.messagingengine.com%3E However, because presumably some other disconnect conditions could trigger the problem as well, we are still looking for a solution that doesn't require fiddling with firewall settings. Is there a way to configure the Collector node to drop/close these inactive connections? i.e. either at the Linux network layer or through Java socket APIs within Flume? Thanks, Frank Grimes On 2012-01-26, at 10:51 AM, Frank Grimes wrote: > Hi All, > > We are using flume-0.9.5 (specifically, > http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and > occasionally our Collector node accumulates too many open TCP connections and > starts madly logging the following errors: > > WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred > during acceptance of message. > org.apache.thrift.transport.TTransportException: java.net.SocketException: > Too many open files > at > org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139) > at > org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) > at > org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175) > Caused by: java.net.SocketException: Too many open files > at java.net.PlainSocketImpl.socketAccept(Native Method) > at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) > at java.net.ServerSocket.implAccept(ServerSocket.java:462) > at java.net.ServerSocket.accept(ServerSocket.java:430) > at > org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134) > ... 2 more > > This quickly fills up the disk as the log file grows to multiple gigabytes in > size. > > After some investigation, it appears that even though the Agent nodes show > single open connections to the Collector, the Collector node appears to have > a bunch of zombie TCP connections open back to the Agent nodes. > i.e. > "lsof -n | grep PORT" on the Agent node shows 1 established connection > However, the Collector node shows hundreds of established connections for > that same port which don't seem to tie up to any connections I can find on > the Agent node. > > So we're concluding that the Collector node is somehow leaking connections. > > Has anyone seen this kind of thing before? > > Could this be related to https://issues.apache.org/jira/browse/FLUME-857? > Or could this be a Thrift bug that could be avoided by switching to Avro > sources/sinks? > > Any hints/tips are most welcome. > > Thanks, > > Frank Grimes