We believe that we've made some progress in identifying the problem.

It appears that we have a slow socket connection leak on the Collector node due 
to sparse data coming in on some Thrift RPC sources.
Turns out we're going through a firewall, and we believe that it is killing 
those inactive connections.

The Agent node's Thrift RPC sink sockets are getting cleaned up after a socket 
timeout on a subsequent append, but the Collector still has its socket 
connections open and they don't appear to ever be timing out and closing.

I found the following which seems to describe the problem:

  
http://mail-archives.apache.org/mod_mbox/incubator-flume-user/201107.mbox/%3c1311642202.14311.2155844...@webmail.messagingengine.com%3E

However, because presumably some other disconnect conditions could trigger the 
problem as well, we are still looking for a solution that doesn't require 
fiddling with firewall settings.

Is there a way to configure the Collector node to drop/close these inactive 
connections? 
i.e. either at the Linux network layer or through Java socket APIs within Flume?

Thanks,

Frank Grimes


On 2012-01-26, at 10:51 AM, Frank Grimes wrote:

> Hi All,
> 
> We are using flume-0.9.5 (specifically, 
> http://svn.apache.org/repos/asf/incubator/flume/trunk@1179275) and 
> occasionally our Collector node accumulates too many open TCP connections and 
> starts madly logging the following errors:
> 
> WARN org.apache.thrift.server.TSaneThreadPoolServer: Transport error occurred 
> during acceptance of message.
> org.apache.thrift.transport.TTransportException: java.net.SocketException: 
> Too many open files
>        at 
> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:139)
>        at 
> org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31)
>        at 
> org.apache.thrift.server.TSaneThreadPoolServer$1.run(TSaneThreadPoolServer.java:175)
> Caused by: java.net.SocketException: Too many open files
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:408)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:462)
>        at java.net.ServerSocket.accept(ServerSocket.java:430)
>        at 
> org.apache.thrift.transport.TSaneServerSocket.acceptImpl(TSaneServerSocket.java:134)
>        ... 2 more
> 
> This quickly fills up the disk as the log file grows to multiple gigabytes in 
> size.
> 
> After some investigation, it appears that even though the Agent nodes show 
> single open connections to the Collector, the Collector node appears to have 
> a bunch of zombie TCP connections open back to the Agent nodes.
> i.e.
> "lsof -n | grep PORT" on the Agent node shows 1 established connection
> However, the Collector node shows hundreds of established connections for 
> that same port which don't seem to tie up to any connections I can find on 
> the Agent node.
> 
> So we're concluding that the Collector node is somehow leaking connections.
> 
> Has anyone seen this kind of thing before?
> 
> Could this be related to https://issues.apache.org/jira/browse/FLUME-857?
> Or could this be a Thrift bug that could be avoided by switching to Avro 
> sources/sinks?
> 
> Any hints/tips are most welcome.
> 
> Thanks,
> 
> Frank Grimes

Reply via email to