Hi Reka, Akila Thanks for the response:
On the system where we encountered the issue we are still running the java agent (JCA). I checked the commit as suggested by Akila and we do have it. The secureEvictionTimePeriod is set to the default (5500 ms). Question, in STRATOS-739 it states (see quote below) that the secureEvictionTimePeriod needs to be tuned according the publishing period – how would I determine the publishing period ? Also, I don’t see the config file (thrift-agent-config.xml) in the JCA package (client side) ? Where should we set the time period, server side or client side ? Is this configuration (or something similar ) expected as well in PCA (we are migrating over but it’s taking some time) ? Thanks Martin Quote from doc: (… If you are publishing events in a periodic interval as more than 5.5s, you need to tune the <secureEvictionTimePeriod> parameter accordingly…) From: Reka Thirunavukkarasu [mailto:r...@wso2.com] Sent: Sunday, September 13, 2015 10:49 PM To: dev Cc: Chamila De Alwis Subject: Re: Stratos 4.1: "Too many open files" issue Hi Martin, What cartridge agent are you using currently? Is it java or Python? This problem was identified from the thrift data publisher side in java. Since python agent is using different approach to connect to data receiver, we will need to verify whether python agent fixed this particular issue. If you could explain who are connecting to stratos using thrift, then we can check on the thrift agent version as Akila mentioned and find out the root cause of this issue. Thanks, Reka On Sun, Sep 13, 2015 at 12:46 PM, Akila Ravihansa Perera <raviha...@wso2.com<mailto:raviha...@wso2.com>> wrote: Hi Martin, I think we fixed this problem by uplifting Thrift agent feature in [1]. The root cause of this issue was that after periodic publishing, thrift agent fails to evict the old connection according to [2]. The fix is described in [3]. Looks like your stack trace is very similar to what has been reported in JIRA. Can you check whether you have this fix applied? [1] https://github.com/apache/stratos/commit/8985d96eb811aa8e9ce2c114f1856b4c4e20517b [2] https://issues.apache.org/jira/browse/STRATOS-723 [3] https://issues.apache.org/jira/browse/STRATOS-739 Thanks. On Sun, Sep 13, 2015 at 10:56 AM, Imesh Gunaratne <im...@apache.org<mailto:im...@apache.org>> wrote: Hi Martin, I believe you are using 4.1.0-RC4 with some custom changes you have done locally. Will you be able to test this on stratos-4.1.x branch latest commit (without having any other changes)? I cannot recall a fix we did after 4.1.0-RC4 for this but it would be better if you can verify with the latest code in stratos-4.1.x branch. At the same time will you be able to do following: * Take a thread dump of the running Stratos, CEP instances once this happens * Check the file descriptor limits of the OS Thanks On Sat, Sep 12, 2015 at 10:56 PM, Martin Eppel (meppel) <mep...@cisco.com<mailto:mep...@cisco.com>> wrote: Resending in case it got lost, Thanks Martin From: Martin Eppel (meppel) Sent: Thursday, September 10, 2015 2:39 PM To: dev@stratos.apache.org<mailto:dev@stratos.apache.org> Subject: Stratos 4.1: "Too many open files" issue Hi, We are seeing an issue with stratos running out of file handles when creating a number of applications and VM instances: The scenario is as follows: 13 applications are deployed, each with a single cluster and a single member instance, As the VMs spin up stratos becomes unresponsive and checking the logs we find the following exceptions (see below). I remember we had seen similar issues (same exceptions) back in stratos 4.0 in the context of longevity tests. We are running stratos 4.1 RC4 with the latest commit commit 0fd41840fb04d92ba921bf58c59c2c3fbad0c561 Author: Imesh Gunaratne <im...@apache.org<mailto:im...@apache.org>> Date: Tue Jul 7 12:54:47 2015 +0530 Is this a known issue which might have been fixed in a later commit or something new ? Can we verify that the fixes for the previous issues are included in our system (jars, commit,s etc …) ? rg.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:118) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) at org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:106) at org.wso2.carbon.databridge.receiver.thrift.internal.ThriftDataReceiver$ServerThread.run(ThriftDataReceiver.java:199) at java.lang.Thread.run(Thread.java:745) TID: [0] [STRATOS] [2015-08-17 17:38:17,499] WARN {org.apache.thrift.server.TThreadPoolServer} - Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:118) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) at org.apache.thrift.server.TThreadPoolServer.serve(TThreadPoolServer.java:106) at org.wso2.carbon.databridge.receiver.thrift.internal.ThriftDataReceiver$ServerThread.run(ThriftDataReceiver.java:199) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketException: Too many open files // listing the applications, member isntances and cartridge state: [di-000-xxx] – application name, di-000-010: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-011: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Initialized 1) cartridge-proxy: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Active 1) di-000-001: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Active 1) di-000-002: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Active 1) di-000-012: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Created 1) di-000-003: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-004: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-006: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-005: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-008: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-007: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) di-000-009: applicationInstances 1, groupInstances 0, clusterInstances 1, members 1 (Starting 1) -- Imesh Gunaratne Senior Technical Lead, WSO2 Committer & PMC Member, Apache Stratos -- Akila Ravihansa Perera WSO2 Inc.; http://wso2.com/ Blog: http://ravihansa3000.blogspot.com -- Reka Thirunavukkarasu Senior Software Engineer, WSO2, Inc.:http://wso2.com, Mobile: +94776442007