[
https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15143863#comment-15143863
]
ASF subversion and git services commented on GEODE-657:
-------------------------------------------------------
Commit 2b2e619e35a44c6e5f1f9881b5fba46d884a3934 in incubator-geode's branch
refs/heads/develop from [~barry.oglesby]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=2b2e619 ]
GEODE-657: Prevent two backups from happening at the same time
Adding a dlock to prevent two backups from occuring at the same time.
> Two backups occurring simultaneously can cause the BackupLock to be obtained
> but not released
> ---------------------------------------------------------------------------------------------
>
> Key: GEODE-657
> URL: https://issues.apache.org/jira/browse/GEODE-657
> Project: Geode
> Issue Type: Bug
> Components: persistence
> Reporter: Barry Oglesby
> Assignee: Dan Smith
>
> Backup is done in three phases, namely {{FlushToDiskRequest}},
> {{PrepareBackupRequest}} and {{FinishBackupRequest}}.
> Here is some debugging in the locator showing the backup phases:
> {noformat}
> [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers STARTED
> [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers about to send FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers received all responses to FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers about to send PrepareBackupRequest
> [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers received all responses to PrepareBackupRequest
> [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers about to send FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers received all responses to FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl
> backupAllMembers COMPLETED
> {noformat}
> Here is some debugging in one of the members. The others are similar.
> {noformat}
> [warning 2015/12/01 14:37:38.795 PST <Pooled Message Processor 2> tid=0x66]
> FlushToDiskRequest createResponse received FlushToDiskRequest
> [warning 2015/12/01 14:37:38.974 PST <Pooled Message Processor 2> tid=0x66]
> FlushToDiskRequest createResponse sending FlushToDiskResponse
> [warning 2015/12/01 14:37:38.978 PST <Pooled Message Processor 2> tid=0x66]
> PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:37:38.978 PST <Pooled Message Processor 2> tid=0x66]
> PrepareBackupResponse createResponse: About to start backup
> [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66]
> PrepareBackupResponse createResponse: About to prepare backup
> [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66]
> BackupLock lockForBackup invoked
> [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66]
> PrepareBackupRequest createResponse sending PrepareBackupResponse
> [warning 2015/12/01 14:37:38.986 PST <Pooled Message Processor 2> tid=0x66]
> FinishBackupRequest createResponse received FinishBackupRequest
> [warning 2015/12/01 14:37:39.970 PST <Pooled Message Processor 2> tid=0x66]
> BackupLock unlockForBackup invoked
> [warning 2015/12/01 14:37:49.885 PST <Pooled Message Processor 2> tid=0x66]
> FinishBackupRequest createResponse sending FinishBackupResponse
> {noformat}
> Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken
> (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is
> where it is released (in {{BackupLock unlockForBackup}}).
> All members operate independently. If one completes its backup before the
> others, the issue can happen. That member is available to start a new backup,
> but the others aren't.
> In this case, the member successfully processes the {{PrepareBackupRequest}}
> and takes the {{BackupLock}}.
> The other members do not successfully process the {{PrepareBackupRequest}}
> and do not take the {{BackupLock}}:
> {noformat}
> [warning 2015/12/01 14:46:02.447 PST <Pooled Message Processor 3> tid=0xd7]
> PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7]
> PrepareBackupResponse createResponse: About to start backup
> [error 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7]
> Error processing request class
> com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
> java.io.IOException: Backup already in progress
> at
> com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
> at
> com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
> at
> com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
> at
> com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
> at
> com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
> at java.lang.Thread.run(Thread.java:745)
> [warning 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7]
> PrepareBackupResponse createResponse: Returning exception
> java.io.IOException: Backup already in progress
> {noformat}
> Once this happens, the members are out of sync. One has taken the
> {{BackupLock}}; the others have not. The member that has taken the
> {{BackupLock}} will never release it.
> One solution would be to prevent any backup from starting if one is in
> progress (in the locator). Another would be to force the member to release
> the {{BackupLock}} if other members fail to get it. I think this would
> require a new message, though.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)