So to the original question, is it your opinion that a single management server (non-clustered) should also fence itself, or wait for the database connection to be restored?
On Mon, Jul 8, 2013 at 2:03 PM, Kelven Yang <kelven.y...@citrix.com> wrote: > > > On 7/7/13 3:36 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote: > >>I think there are two separate issues here. >> >>1) The management server uses the database to determine cluster >>membership, and if no database connection can be made, the management >>server fences itself (shuts down). This is good, but in the case where >>there's only one management server (no cluster intended), it seems >>like an issue. However, it may be better to shut down, I'm not sure >>how the management server will react after a temporary database >>outage. Some opinions would be appreciated, my preference would be >>that a single-management server would just be able to pick back up >>where it left off rather than dying. > > In a management server cluster setup with multiple management servers, to > avoid split-brian situation we will actively perform management server > self-fence as soon as the detection of inconsistent view of the cluster > from individual management servers. > > As the clustering logic relies on DB heavily, lost of DB connectivity is > considered as a fatal event to trigger self-fence in addition to the > inconsistent view detection. For a multi-master DB setup, it only works if > the switch of database instance is transparent to CloudStack. Means, > database automatic fail-over should be completely handled at DB > connectivity layer and CloudStack should not be aware of it. Most of > current CloudStack logic is built upon such assumption, it may be possible > to relax this requirement, but we need to investigate the impact and test > out how resilient CloudStack would be to unexpected DB connectivity > exceptions in the middle of various orchestration work flows > >> >>2) There is no support for JDBC's built-in loadbalancing features. I >>have a patch that fixes this, however I noticed a few things that I'd >>like some feedback on. Namely, the awsapi database connection doesn't >>have its own settings, rather it uses the same host connection >>settings as the cloud db and the autoReconnect setting from the usage >>database settings. Was this a shortcut, or is there a reason for it? >>My current version of the patch just keeps the same methodology, but >>it seems that while I'm at adding properties to db.properties we could >>allow true db.awsapi.host and db.awsapi.port. >> >>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen <shadow...@gmail.com> >>wrote: >>> Oh, and I should correct myself, it doesn't crash, it seems that the >>> management server fences itself because it can't talk to the database. >>> >>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen <shadow...@gmail.com> >>>wrote: >>>> Ok. After a cursory look, I've seen that the autoReconnect is kind of >>>> a bad option for jdbc. I've also found this, which seems kind of hairy >>>> for what I want to do: >>>> >>>> >>>>http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conce >>>>pts-managing-load-balanced-connections.html >>>> >>>> I don't necessarily want to hand off the loadbalancing management to >>>> the java code, I just want cloudstack to automatically reinitialize >>>> the database connection when this 'communications link failure' >>>> occurs, maybe with a db.cloud.connection.retry.count property or >>>> similar. >>>> >>>> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander <w...@widodh.nl> >>>>wrote: >>>>> Hi, >>>>> >>>>> >>>>> On 07/07/2013 08:45 AM, Marcus Sorensen wrote: >>>>>> >>>>>> I see that my db.properties has db.cloud.autoReconnect=true, which >>>>>> translates to setting autoReconnect in the jdbc driver connection in >>>>>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I >>>>>> manually trigger the issue I get: >>>>>> >>>>> >>>>> Just to confirm, I see the same issues. I haven't looked into this >>>>>yet, but >>>>> this is also one of the things I want to have fixed. >>>>> >>>>> Maybe create an issue for it? >>>>> >>>>> Wido >>>>> >>>>> >>>>>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl] >>>>>> (Cluster-Heartbeat-1:null) Runtime DB exception >>>>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: >>>>>> Communications link failure >>>>>> >>>>>> The last packet successfully received from the server was 1,503 >>>>>> milliseconds ago. The last packet sent successfully to the server >>>>>>was >>>>>> 0 milliseconds ago. >>>>>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown >>>>>>Source) >>>>>> at >>>>>> >>>>>>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo >>>>>>nstructorAccessorImpl.java:45) >>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532) >>>>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411) >>>>>> at >>>>>> >>>>>>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:11 >>>>>>17) >>>>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567) >>>>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456) >>>>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997) >>>>>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468) >>>>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629) >>>>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719) >>>>>> at >>>>>> >>>>>>com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.jav >>>>>>a:2155) >>>>>> at >>>>>> >>>>>>com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2 >>>>>>318) >>>>>> at >>>>>> >>>>>>org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(Deleg >>>>>>atingPreparedStatement.java:96) >>>>>> at >>>>>> >>>>>>org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(Deleg >>>>>>atingPreparedStatement.java:96) >>>>>> at >>>>>> >>>>>>com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBas >>>>>>e.java:409) >>>>>> at >>>>>> >>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep >>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) >>>>>> at >>>>>> >>>>>>com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBas >>>>>>e.java:350) >>>>>> at >>>>>> >>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep >>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) >>>>>> at >>>>>> >>>>>>com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBas >>>>>>e.java:907) >>>>>> at >>>>>> >>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep >>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) >>>>>> at >>>>>> >>>>>>com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBas >>>>>>e.java:912) >>>>>> at >>>>>> >>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep >>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) >>>>>> at >>>>>> >>>>>>com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(Manage >>>>>>mentServerHostDaoImpl.java:158) >>>>>> at >>>>>> >>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep >>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125) >>>>>> at >>>>>> >>>>>>com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java: >>>>>>1057) >>>>>> at >>>>>> >>>>>>com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.ja >>>>>>va:95) >>>>>> at >>>>>>com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789 >>>>>>) >>>>>> at >>>>>>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471 >>>>>>) >>>>>> at >>>>>> >>>>>>java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java: >>>>>>351) >>>>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >>>>>> at >>>>>> >>>>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a >>>>>>ccess$201(ScheduledThreadPoolExecutor.java:165) >>>>>> at >>>>>> >>>>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r >>>>>>un(ScheduledThreadPoolExecutor.java:267) >>>>>> at >>>>>> >>>>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j >>>>>>ava:1146) >>>>>> at >>>>>> >>>>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. >>>>>>java:615) >>>>>> at java.lang.Thread.run(Thread.java:679) >>>>>> Caused by: java.io.EOFException: Can not read response from server. >>>>>> Expected to read 4 bytes, read 0 bytes before connection was >>>>>> unexpectedly lost. >>>>>> ... 55 more >>>>>> 2013-07-07 00:42:50,505 ERROR [cloud.cluster.ClusterManagerImpl] >>>>>> (Cluster-Heartbeat-1:null) DB communication problem detected, fence >>>>>>it >>>>>> >>>>>> And I have only to restart cloudstack-management so it can connect to >>>>>> another member in the loadbalanced multimaster database to get things >>>>>> running again. >>>>>> >>>>>> >>>>>> On Sun, Jul 7, 2013 at 12:35 AM, Marcus Sorensen >>>>>><shadow...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> I've noticed that the cloudstack management server creates >>>>>>>persistent >>>>>>> connections to the database, and crashes if the database connection >>>>>>>is >>>>>>> lost. I haven't looked at the code yet, but I was wondering if >>>>>>>anyone >>>>>>> knew about what was going on here, if it's simply not set up to >>>>>>> gracefully handle reconnect, or something else. We have a >>>>>>> multi-master database setup, but cloudstack doesn't take advantage >>>>>>>of >>>>>>> it since it doesn't attempt graceful reconnect, if the particular >>>>>>>node >>>>>>> it connected to on startup goes down, it simply crashes. >