I had another "incorrect" failover this morning.  The secondary server was 
responding and processing user requests fine (as far as customers and 
monitoring reported) but there is a nearly 2 minute gap in the sql log-where of 
course no updates are made to the servgrp_board table.  According to what you 
say here, that would cause the failover alright...but what would cause the 
"hang-up" like this?  There are actually no logs at all for this 2 minute time 
period....That is obviously a different issue.

BUT the other intervals it appears to be finding delinquency, I don't see a gap 
like that.  All of the updates take place (see example below).  Does it matter 
what order they happen in?  What's the logic of figuring out delinquency?

>From SQL log on ncc392:
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9944 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9956 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9957 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9964 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9976 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:00.9981 */COMMIT WORK

000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0016 */SELECT checkInterval FROM servgrp_config FOR UPDATE
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0248 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0249 */UPDATE servgrp_board SET intervalCount = intervalCount
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0255 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0256 */SELECT serverName,serverPort,intervalCount,pendingSigna
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0267 */OK
000000   > <USER:                                              > /* Wed Jun 22 
2011 09:59:31.0272 */COMMIT WORK

But in the server group log on ncc391:
<SGRP> /* Wed Jun 22 2011 09:59:31.0224 */ Server ncc392.its.state.nc.us 
determined to be delinquent with 4 operation(s) owned
<SGRP> /* Wed Jun 22 2011 10:00:01.0659 */ Server ncc392.its.state.nc.us 
determined to be operational

In the sql log on ncc391:
Wed Jun 22 2011 09:59:00.9937 */SELECT checkInterval FROM servgrp_config FOR 
UPDATE
Wed Jun 22 2011 09:59:01.0138 */OK
Wed Jun 22 2011 09:59:01.0139 */UPDATE servgrp_board SET intervalCount = 
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:01.0145 */OK
Wed Jun 22 2011 09:59:01.0145 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
Wed Jun 22 2011 09:59:01.0150 */OK
Wed Jun 22 2011 09:59:01.0160 */COMMIT WORK

Wed Jun 22 2011 09:59:31.0188 */SELECT checkInterval FROM servgrp_config FOR 
UPDATE
Wed Jun 22 2011 09:59:31.0194 */OK
Wed Jun 22 2011 09:59:31.0200 */OK
Wed Jun 22 2011 09:59:31.0200 */UPDATE servgrp_board SET intervalCount = 
intervalCount + 1 WHERE serverName = <servername>
Wed Jun 22 2011 09:59:31.0202 */SELECT C18058 FROM T68 WHERE C1 = 
'000000000000015'
Wed Jun 22 2011 09:59:31.0207 */OK
Wed Jun 22 2011 09:59:31.0207 */SELECT 
serverName,serverPort,intervalCount,pendingSignals,opFlags FROM servgrp_board 
ORDER BY 1
Wed Jun 22 2011 09:59:31.0209 */OK
Wed Jun 22 2011 09:59:31.0212 */UPDATE T69 SET 
C18099=1,C18103=NULL,C5='AREMAILUSER_PQHg0JAAPwAA',C6=1308751171 WHERE C1 = 
'000000010478546'
Wed Jun 22 2011 09:59:31.0215 */OK
Wed Jun 22 2011 09:59:31.0220 */OK
Wed Jun 22 2011 09:59:31.0220 */UPDATE T69 SET C18087 = NULL WHERE C1 IN 
('000000010478546')
Wed Jun 22 2011 09:59:31.0226 */COMMIT WORK
Wed Jun 22 2011 09:59:31.0226 */OK
Wed Jun 22 2011 09:59:31.0227 */COMMIT WORK


Anne Ramey
***********************************
E-mail correspondence to and from this address may be subject to the North 
Carolina Public Records Law and may be disclosed to third parties only by an 
authorized State Official.

From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Walters, Mark
Sent: Monday, May 23, 2011 10:51 AM
To: arslist@ARSLIST.ORG
Subject: Re: server group delinquency, inaccurate

**
The signalling options should not be relevant for delinquency checks - this is 
done via updates to the database.  In the SERVGRP_BOARD table there is a record 
for each server in your group.  Every check interval a server will increment 
its own INTERVALCOUNT value and note the values of the other servers.  At the 
next check interval the stored values are compared to those in the database 
and, if any of them have not changed, that server is considered delinquent.  If 
this happens X successive times, where X is the delinquency threshold, then the 
server is considered to have failed and a failover of the services to the next 
highest ranked server takes place.

Although there's a dedicated server group thread that handles this process it's 
not uncommon to see an occasional log entry reporting a delinquent server.  
This happens due to slight timing variations of when the intervalcount update 
process is performed on each server.   Each server has an exclusive lock on the 
SERVGRP tables during the update process and this can lead to one server 
running 2 check interval updates before another server has been able to perform 
its update.

However, as you had a failover, something must have prevented the intervalcount 
update on the server called secondary.  You could try increasing the 
delinquency threshold to 3 to give each server more time for the update to run 
or enabling SQL logging and check the updates to the SERVGRP_BOARD table.  You 
should see something like the following on each server every checkinterval;

SELECT checkInterval FROM servgrp_config FOR UPDATE
OK
UPDATE servgrp_board SET intervalCount = intervalCount + 1 WHERE serverName = 
'myservername'
OK
SELECT serverName,serverPort,intervalCount,pendingSignals,opFlags FROM 
servgrp_board ORDER BY 1
OK
COMMIT WORK

The SELECT  FOR UPDATE is used to serialise access to the tables and provide 
exclusive access for the server during the update.  If the clocks on the 
servers are sync'd you should be able to explain most of the delinquency 
reports in the server group log and perhaps understand why one was missed.

Mark

I work for BMC, I don't speak for them.



From: Action Request System discussion list(ARSList) 
[mailto:arslist@ARSLIST.ORG] On Behalf Of Ramey, Anne
Sent: 21 May 2011 02:14
To: arslist@ARSLIST.ORG
Subject: server group delinquency, inaccurate

**
Since we upgraded from ARS 7.1 to ARS 7.5 (patch 7), we've seen a lot of 
"incorrect" delinquency messages in the server group logs-each server detecting 
the other as delinquent quite frequently, even when they are responding just 
fine.  It happens at all hours, even after backups are finished and no one is 
in the office for the morning yet.  The systems are NOT busy, but they are 
being "delinquent". We are using the same settings of a 30 second interval with 
2 intervals before failover that we used before the upgrade. 
Server-Group-Signal-Option is already false.  We don't want to go all the way 
to Disable-ARSignals.  It got bad enough today to happen for two intervals and 
cause a failover during the work day.  This, of course, caused the system to 
hang for a moment, and it managed to kill the email engine-which didn't fail 
back over properly.  Has anyone else with ARS 7.5 p 7 seen something like this?

Primary:
<SGRP> /* Fri May 20 2011 11:57:37.5506 */ Server <secondary> determined to be 
operational
<SGRP> /* Fri May 20 2011 12:00:07.6808 */ Server <secondary> determined to be 
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 12:00:37.7094 */ Server <secondary> determined to be 
operational
<SGRP> /* Fri May 20 2011 16:49:52.4853 */ Server <secondary> determined to be 
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 16:50:22.4912 */ Server <secondary> delinquent for 2 
intervals with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 16:50:22.4913 */ Transferring pending full text 
indexing tasks from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4939 */ Transferred 0 pending full text 
indexing tasks from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4939 */ Failing over E-Mail Engine operation 
from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.4956 */ Resuming E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:50:22.4993 */ Failing over Flashboards operation 
from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5011 */ Resuming Flashboards operation
<SGRP> /* Fri May 20 2011 16:50:22.5038 */ Failing over Assignment Engine 
operation from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5055 */ Resuming Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:50:22.5096 */ Failing over Reconciliation Engine 
operation from server <secondary> to server <primary>
<SGRP> /* Fri May 20 2011 16:50:22.5113 */ Resuming Reconciliation Engine 
operation
<SGRP> /* Fri May 20 2011 16:54:22.6434 */ Server <secondary> determined to be 
operational
<SGRP> /* Fri May 20 2011 16:54:22.6435 */ Server <primary> relinquishing 
E-Mail Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6435 */ Suspending E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:22.6460 */ Server <primary> relinquishing 
Flashboards operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6461 */ Suspending Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:22.6544 */ Server <primary> relinquishing 
Assignment Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6544 */ Suspending Assignment Engine 
operation
<SGRP> /* Fri May 20 2011 16:54:22.6615 */ Server <primary> relinquishing 
Reconciliation Engine operation to server <secondary>
<SGRP> /* Fri May 20 2011 16:54:22.6616 */ Suspending Reconciliation Engine 
operation
<SGRP> /* Fri May 20 2011 18:44:11.7233 */ Server <secondary> determined to be 
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:44:41.7522 */ Server <secondary> determined to be 
operational
<SGRP> /* Fri May 20 2011 18:56:42.0804 */ Server <secondary> determined to be 
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:57:12.0894 */ Server <secondary> determined to be 
operational
<SGRP> /* Fri May 20 2011 18:58:42.1558 */ Server <secondary> determined to be 
delinquent with 4 operation(s) owned
<SGRP> /* Fri May 20 2011 18:59:12.2134 */ Server <secondary> determined to be 
operational


Secondary:
<SGRP> /* Fri May 20 2011 11:56:37.4981 */ Server <primary> determined to be 
operational
<SGRP> /* Fri May 20 2011 11:59:37.6687 */ Server <primary> determined to be 
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 12:00:07.9562 */ Server <primary> determined to be 
operational
<SGRP> /* Fri May 20 2011 16:54:02.0895 */ Ownership of the E-Mail Engine 
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0896 */ Suspending E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:02.0932 */ Ownership of the Flashboards 
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0933 */ Suspending Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:02.0963 */ Ownership of the Assignment Engine 
operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.0963 */ Suspending Assignment Engine 
operation
<SGRP> /* Fri May 20 2011 16:54:02.1026 */ Ownership of the Reconciliation 
Engine operation removed unexpectedly from server <secondary>
<SGRP> /* Fri May 20 2011 16:54:02.1027 */ Suspending Reconciliation Engine 
operation
<SGRP> /* Fri May 20 2011 16:54:32.1604 */ Server <secondary> assuming 
responsibility for unowned E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1605 */ Resuming E-Mail Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1634 */ Server <secondary> assuming 
responsibility for unowned Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:32.1635 */ Resuming Flashboards operation
<SGRP> /* Fri May 20 2011 16:54:32.1683 */ Server <secondary> assuming 
responsibility for unowned Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1684 */ Resuming Assignment Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1719 */ Server <secondary> assuming 
responsibility for unowned Reconciliation Engine operation
<SGRP> /* Fri May 20 2011 16:54:32.1719 */ Resuming Reconciliation Engine 
operation
<SGRP> /* Fri May 20 2011 18:51:41.9330 */ Server <primary> determined to be 
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 18:52:11.9556 */ Server <primary> determined to be 
operational
<SGRP> /* Fri May 20 2011 18:57:42.1122 */ Server <primary> determined to be 
delinquent with 8 operation(s) owned
<SGRP> /* Fri May 20 2011 18:58:12.1257 */ Server <primary> determined to be 
operational


Anne Ramey
***********************************
E-mail correspondence to and from this address may be subject to the North 
Carolina Public Records Law and may be disclosed to third parties only by an 
authorized State Official.

_attend WWRUG11 www.wwrug.com<http://www.wwrug.com> ARSlist: "Where the Answers 
Are"_
_attend WWRUG11 www.wwrug.com ARSlist: "Where the Answers Are"_

_______________________________________________________________________________
UNSUBSCRIBE or access ARSlist Archives at www.arslist.org
attend wwrug11 www.wwrug.com ARSList: "Where the Answers Are"

Reply via email to