https://issues.apache.org/jira/browse/FLUME-943
On 2012-01-30, at 10:49 AM, Frank Grimes wrote: > I think this bug might be addressed by making use of TCP keepalive on the > Thrift server socket. e.g. > > Index: > flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java > =================================================================== > --- > flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java > (revision 1237721) > +++ > flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java > (working copy) > @@ -132,6 +132,7 @@ > } > try { > Socket result = serverSocket_.accept(); > + result.setKeepAlive(true); > TSocket result2 = new TBufferedSocket(result); > result2.setTimeout(clientTimeout_); > return result2; > > I believe that on Linux that would force the connections to be closed/cleaned > up after 2 hours by default. > This is likely good enough to prevent the "java.net.SocketException: Too many > open files" from occurring in our case. > Note that it's also configurable as per > http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive. > > Shall I open up a JIRA case for this and submit a patch? > Should the keepalive be configurable or is it desirable to always have the > Flume collector protected from these kinds of killed connections? > I can't think of any downsides to always having it on... > > Cheers, > > Frank Grimes > > > On 2012-01-28, at 12:20 PM, Frank Grimes wrote: > >> We believe that we've made some progress in identifying the problem. >> >> It appears that we have a slow socket connection leak on the Collector node >> due to sparse data coming in on some Thrift RPC sources. >> Turns out we're going through a firewall, and we believe that it is killing >> those inactive connections. >> >> The Agent node's Thrift RPC sink sockets are getting cleaned up after a >> socket timeout on a subsequent append, but the Collector still has its >> socket connections open and they don't appear to ever be timing out and >> closing. >> >> I found the following which seems to describe the problem: >> >> >> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3c1311642202.14311.2155844...@webmail.messagingengine.com%3E >> >> However, because presumably some other disconnect conditions could trigger >> the problem as well, we are still looking for a solution that doesn't >> require fiddling with firewall settings. >> >> Is there a way to configure the Collector node to drop/close these inactive >> connections? >> i.e. either at the Linux network layer or through Java socket APIs within >> Flume? >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-26, at 10:51 AM, Frank Grimes wrote: >> >>> Hi All, >>> >>> We are using flume-0.9.5 (specifically, >>> http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and >>> occasionally our Collector node accumulates too many open TCP connections >>> and starts madly logging the following errors: >>> >>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error >>> occurred during acceptance of message. >>> org.apache.thrift.transport.TTransportException: java.net.SocketException: >>> Too many open files >>> at >>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139) >>> at >>> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) >>> at >>> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175) >>> Caused by: java.net.SocketException: Too many open files >>> at java.net.PlainSocketImpl.socketAccept(Native Method) >>> at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408) >>> at java.net.ServerSocket.implAccept(ServerSocket.java:462) >>> at java.net.ServerSocket.accept(ServerSocket.java:430) >>> at >>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134) >>> ... 2 more >>> >>> This quickly fills up the disk as the log file grows to multiple gigabytes >>> in size. >>> >>> After some investigation, it appears that even though the Agent nodes show >>> single open connections to the Collector, the Collector node appears to >>> have a bunch of zombie TCP connections open back to the Agent nodes. >>> i.e. >>> "lsof -n | grep PORT" on the Agent node shows 1 established connection >>> However, the Collector node shows hundreds of established connections for >>> that same port which don't seem to tie up to any connections I can find on >>> the Agent node. >>> >>> So we're concluding that the Collector node is somehow leaking connections. >>> >>> Has anyone seen this kind of thing before? >>> >>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857? >>> Or could this be a Thrift bug that could be avoided by switching to Avro >>> sources/sinks? >>> >>> Any hints/tips are most welcome. >>> >>> Thanks, >>> >>> Frank Grimes >> >