https://issues.apache.org/jira/browse/FLUME-943


On 2012-01-30, at 10:49 AM, Frank Grimes wrote:

> I think this bug might be addressed by making use of TCP keepalive on the 
> Thrift server socket. e.g.
> 
> Index: 
> flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java
> ===================================================================
> --- 
> flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java   
>     (revision 1237721)
> +++ 
> flume-core/src/main/java/org/apache/thrift/transport/TSaneServerSocket.java   
>     (working copy)
> @@ -132,6 +132,7 @@
>      }
>      try {
>        Socket result = serverSocket_.accept();
> +      result.setKeepAlive(true); 
>        TSocket result2 = new TBufferedSocket(result);
>        result2.setTimeout(clientTimeout_);
>        return result2;
> 
> I believe that on Linux that would force the connections to be closed/cleaned 
> up after 2 hours by default.
> This is likely good enough to prevent the "java.net.SocketException: Too many 
> open files" from occurring in our case.
> Note that it's also configurable as per 
> http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#usingkeepalive.
> 
> Shall I open up a JIRA case for this and submit a patch?
> Should the keepalive be configurable or is it desirable to always have the 
> Flume collector protected from these kinds of killed connections?
> I can't think of any downsides to always having it on...
> 
> Cheers,
> 
> Frank Grimes
> 
> 
> On 2012-01-28, at 12:20 PM, Frank Grimes wrote:
> 
>> We believe that we've made some progress in identifying the problem.
>> 
>> It appears that we have a slow socket connection leak on the Collector node 
>> due to sparse data coming in on some Thrift RPC sources.
>> Turns out we're going through a firewall, and we believe that it is killing 
>> those inactive connections.
>> 
>> The Agent node's Thrift RPC sink sockets are getting cleaned up after a 
>> socket timeout on a subsequent append, but the Collector still has its 
>> socket connections open and they don't appear to ever be timing out and 
>> closing.
>> 
>> I found the following which seems to describe the problem:
>> 
>>   
>> http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3c1311642202.14311.2155844...@webmail.messagingengine.com%3E
>> 
>> However, because presumably some other disconnect conditions could trigger 
>> the problem as well, we are still looking for a solution that doesn't 
>> require fiddling with firewall settings.
>> 
>> Is there a way to configure the Collector node to drop/close these inactive 
>> connections? 
>> i.e. either at the Linux network layer or through Java socket APIs within 
>> Flume?
>> 
>> Thanks,
>> 
>> Frank Grimes
>> 
>> 
>> On 2012-01-26, at 10:51 AM, Frank Grimes wrote:
>> 
>>> Hi All,
>>> 
>>> We are using flume-0.9.5 (specifically, 
>>> http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and 
>>> occasionally our Collector node accumulates too many open TCP connections 
>>> and starts madly logging the following errors:
>>> 
>>> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error 
>>> occurred during acceptance of message.
>>> org.apache.thrift.transport.TTransportException: java.net.SocketException: 
>>> Too many open files
>>>        at 
>>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>>>        at 
>>> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>>>        at 
>>> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
>>> Caused by: java.net.SocketException: Too many open files
>>>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>>>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>>>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>>>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>>>        at 
>>> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>>>        ... 2 more
>>> 
>>> This quickly fills up the disk as the log file grows to multiple gigabytes 
>>> in size.
>>> 
>>> After some investigation, it appears that even though the Agent nodes show 
>>> single open connections to the Collector, the Collector node appears to 
>>> have a bunch of zombie TCP connections open back to the Agent nodes.
>>> i.e.
>>> "lsof -n | grep PORT" on the Agent node shows 1 established connection
>>> However, the Collector node shows hundreds of established connections for 
>>> that same port which don't seem to tie up to any connections I can find on 
>>> the Agent node.
>>> 
>>> So we're concluding that the Collector node is somehow leaking connections.
>>> 
>>> Has anyone seen this kind of thing before?
>>> 
>>> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
>>> Or could this be a Thrift bug that could be avoided by switching to Avro 
>>> sources/sinks?
>>> 
>>> Any hints/tips are most welcome.
>>> 
>>> Thanks,
>>> 
>>> Frank Grimes
>> 
> 

Reply via email to