Re: database connection resilience

Marcus Sorensen Mon, 08 Jul 2013 13:41:37 -0700

So to the original question, is it your opinion that a single
management server (non-clustered) should also fence itself, or wait
for the database connection to be restored?


On Mon, Jul 8, 2013 at 2:03 PM, Kelven Yang <kelven.y...@citrix.com> wrote:
>
>
> On 7/7/13 3:36 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote:
>
>>I think there are two separate issues here.
>>
>>1) The management server uses the database to determine cluster
>>membership, and if no database connection can be made, the management
>>server fences itself (shuts down). This is good, but in the case where
>>there's only one management server (no cluster intended), it seems
>>like an issue. However, it may be better to shut down, I'm not sure
>>how the management server will react after a temporary database
>>outage. Some opinions would be appreciated, my preference would be
>>that a single-management server would just be able to pick back up
>>where it left off rather than dying.
>
> In a management server cluster setup with multiple management servers, to
> avoid split-brian situation we will actively perform management server
> self-fence as soon as the detection of inconsistent view of the cluster
> from individual management servers.
>
> As the clustering logic relies on DB heavily, lost of DB connectivity is
> considered as a fatal event to trigger self-fence in addition to the
> inconsistent view detection. For a multi-master DB setup, it only works if
> the switch of database instance is transparent to CloudStack. Means,
> database automatic fail-over should be completely handled at DB
> connectivity layer and CloudStack should not be aware of it. Most of
> current CloudStack logic is built upon such assumption, it may be possible
> to relax this requirement, but we need to investigate the impact and test
> out how resilient CloudStack would be to unexpected DB connectivity
> exceptions in the middle of various orchestration work flows
>
>>
>>2) There is no support for JDBC's built-in loadbalancing features. I
>>have a patch that fixes this, however I noticed a few things that I'd
>>like some feedback on. Namely, the awsapi database connection doesn't
>>have its own settings, rather it uses the same host connection
>>settings as the cloud db and the autoReconnect setting from the usage
>>database settings. Was this a shortcut, or is there a reason for it?
>>My current version of the patch just keeps the same methodology, but
>>it seems that while I'm at adding properties to db.properties we could
>>allow true db.awsapi.host and db.awsapi.port.
>>
>>On Sun, Jul 7, 2013 at 1:02 AM, Marcus Sorensen <shadow...@gmail.com>
>>wrote:
>>> Oh, and I should correct myself, it doesn't crash, it seems that the
>>> management server fences itself because it can't talk to the database.
>>>
>>> On Sun, Jul 7, 2013 at 12:59 AM, Marcus Sorensen <shadow...@gmail.com>
>>>wrote:
>>>> Ok. After a cursory look, I've seen that the autoReconnect is kind of
>>>> a bad option for jdbc. I've also found this, which seems kind of hairy
>>>> for what I want to do:
>>>>
>>>>
>>>>http://dev.mysql.com/doc/refman/5.0/en/connector-j-usagenotes-j2ee-conce
>>>>pts-managing-load-balanced-connections.html
>>>>
>>>> I don't necessarily want to hand off the loadbalancing management to
>>>> the java code, I just want cloudstack to automatically reinitialize
>>>> the database connection when this 'communications link failure'
>>>> occurs, maybe with a db.cloud.connection.retry.count property or
>>>> similar.
>>>>
>>>> On Sun, Jul 7, 2013 at 12:54 AM, Wido den Hollander <w...@widodh.nl>
>>>>wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> On 07/07/2013 08:45 AM, Marcus Sorensen wrote:
>>>>>>
>>>>>> I see that my db.properties has db.cloud.autoReconnect=true, which
>>>>>> translates to setting autoReconnect in the jdbc driver connection in
>>>>>> utils/src/com/cloud/utils/db/Transaction.java. I also see that if I
>>>>>> manually trigger the issue I get:
>>>>>>
>>>>>
>>>>> Just to confirm, I see the same issues. I haven't looked into this
>>>>>yet, but
>>>>> this is also one of the things I want to have fixed.
>>>>>
>>>>> Maybe create an issue for it?
>>>>>
>>>>> Wido
>>>>>
>>>>>
>>>>>> 013-07-07 00:42:50,502 ERROR [cloud.cluster.ClusterManagerImpl]
>>>>>> (Cluster-Heartbeat-1:null) Runtime DB exception
>>>>>> com.mysql.jdbc.exceptions.jdbc4.CommunicationsException:
>>>>>> Communications link failure
>>>>>>
>>>>>> The last packet successfully received from the server was 1,503
>>>>>> milliseconds ago.  The last packet sent successfully to the server
>>>>>>was
>>>>>> 0 milliseconds ago.
>>>>>> at sun.reflect.GeneratedConstructorAccessor159.newInstance(Unknown
>>>>>>Source)
>>>>>> at
>>>>>>
>>>>>>sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingCo
>>>>>>nstructorAccessorImpl.java:45)
>>>>>> at java.lang.reflect.Constructor.newInstance(Constructor.java:532)
>>>>>> at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
>>>>>> at
>>>>>>
>>>>>>com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:11
>>>>>>17)
>>>>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3567)
>>>>>> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3456)
>>>>>> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3997)
>>>>>> at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2468)
>>>>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2629)
>>>>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2719)
>>>>>> at
>>>>>>
>>>>>>com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.jav
>>>>>>a:2155)
>>>>>> at
>>>>>>
>>>>>>com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2
>>>>>>318)
>>>>>> at
>>>>>>
>>>>>>org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(Deleg
>>>>>>atingPreparedStatement.java:96)
>>>>>> at
>>>>>>
>>>>>>org.apache.commons.dbcp.DelegatingPreparedStatement.executeQuery(Deleg
>>>>>>atingPreparedStatement.java:96)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBas
>>>>>>e.java:409)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep
>>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.db.GenericDaoBase.searchIncludingRemoved(GenericDaoBas
>>>>>>e.java:350)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep
>>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBas
>>>>>>e.java:907)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep
>>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.db.GenericDaoBase.listIncludingRemovedBy(GenericDaoBas
>>>>>>e.java:912)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep
>>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>>>>> at
>>>>>>
>>>>>>com.cloud.cluster.dao.ManagementServerHostDaoImpl.getActiveList(Manage
>>>>>>mentServerHostDaoImpl.java:158)
>>>>>> at
>>>>>>
>>>>>>com.cloud.utils.component.ComponentInstantiationPostProcessor$Intercep
>>>>>>torDispatcher.intercept(ComponentInstantiationPostProcessor.java:125)
>>>>>> at
>>>>>>
>>>>>>com.cloud.cluster.ClusterManagerImpl.peerScan(ClusterManagerImpl.java:
>>>>>>1057)
>>>>>> at
>>>>>>
>>>>>>com.cloud.cluster.ClusterManagerImpl.access$1200(ClusterManagerImpl.ja
>>>>>>va:95)
>>>>>> at
>>>>>>com.cloud.cluster.ClusterManagerImpl$4.run(ClusterManagerImpl.java:789
>>>>>>)
>>>>>> at
>>>>>>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471
>>>>>>)
>>>>>> at
>>>>>>
>>>>>>java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
>>>>>>351)
>>>>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>>>>>> at
>>>>>>
>>>>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
>>>>>>ccess$201(ScheduledThreadPoolExecutor.java:165)
>>>>>> at
>>>>>>
>>>>>>java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
>>>>>>un(ScheduledThreadPoolExecutor.java:267)
>>>>>> at
>>>>>>
>>>>>>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
>>>>>>ava:1146)
>>>>>> at
>>>>>>
>>>>>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>>>>>>java:615)
>>>>>> at java.lang.Thread.run(Thread.java:679)
>>>>>> Caused by: java.io.EOFException: Can not read response from server.
>>>>>> Expected to read 4 bytes, read 0 bytes before connection was
>>>>>> unexpectedly lost.
>>>>>> ... 55 more
>>>>>> 2013-07-07 00:42:50,505 ERROR [cloud.cluster.ClusterManagerImpl]
>>>>>> (Cluster-Heartbeat-1:null) DB communication problem detected, fence
>>>>>>it
>>>>>>
>>>>>> And I have only to restart cloudstack-management so it can connect to
>>>>>> another member in the loadbalanced multimaster database to get things
>>>>>> running again.
>>>>>>
>>>>>>
>>>>>> On Sun, Jul 7, 2013 at 12:35 AM, Marcus Sorensen
>>>>>><shadow...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> I've noticed that the cloudstack management server creates
>>>>>>>persistent
>>>>>>> connections to the database, and crashes if the database connection
>>>>>>>is
>>>>>>> lost. I haven't looked at the code yet, but I was wondering if
>>>>>>>anyone
>>>>>>> knew about what was going on here, if it's simply not set up to
>>>>>>> gracefully handle reconnect, or something else.  We have a
>>>>>>> multi-master database setup, but cloudstack doesn't take advantage
>>>>>>>of
>>>>>>> it since it doesn't attempt graceful reconnect, if the particular
>>>>>>>node
>>>>>>> it connected to on startup goes down, it simply crashes.
>

Re: database connection resilience

Reply via email to