[ https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dan Smith reassigned GEODE-657: ------------------------------- Assignee: Dan Smith > Two backups occurring simultaneously can cause the BackupLock to be obtained > but not released > --------------------------------------------------------------------------------------------- > > Key: GEODE-657 > URL: https://issues.apache.org/jira/browse/GEODE-657 > Project: Geode > Issue Type: Bug > Components: persistence > Reporter: Barry Oglesby > Assignee: Dan Smith > > Backup is done in three phases, namely {{FlushToDiskRequest}}, > {{PrepareBackupRequest}} and {{FinishBackupRequest}}. > Here is some debugging in the locator showing the backup phases: > {noformat} > [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers STARTED > [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers about to send FlushToDiskRequest > [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers received all responses to FlushToDiskRequest > [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers about to send PrepareBackupRequest > [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers received all responses to PrepareBackupRequest > [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers about to send FinishBackupRequest > [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers received all responses to FinishBackupRequest > [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP > Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl > backupAllMembers COMPLETED > {noformat} > Here is some debugging in one of the members. The others are similar. > {noformat} > [warning 2015/12/01 14:37:38.795 PST <Pooled Message Processor 2> tid=0x66] > FlushToDiskRequest createResponse received FlushToDiskRequest > [warning 2015/12/01 14:37:38.974 PST <Pooled Message Processor 2> tid=0x66] > FlushToDiskRequest createResponse sending FlushToDiskResponse > [warning 2015/12/01 14:37:38.978 PST <Pooled Message Processor 2> tid=0x66] > PrepareBackupRequest createResponse received PrepareBackupRequest > [warning 2015/12/01 14:37:38.978 PST <Pooled Message Processor 2> tid=0x66] > PrepareBackupResponse createResponse: About to start backup > [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66] > PrepareBackupResponse createResponse: About to prepare backup > [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66] > BackupLock lockForBackup invoked > [warning 2015/12/01 14:37:38.980 PST <Pooled Message Processor 2> tid=0x66] > PrepareBackupRequest createResponse sending PrepareBackupResponse > [warning 2015/12/01 14:37:38.986 PST <Pooled Message Processor 2> tid=0x66] > FinishBackupRequest createResponse received FinishBackupRequest > [warning 2015/12/01 14:37:39.970 PST <Pooled Message Processor 2> tid=0x66] > BackupLock unlockForBackup invoked > [warning 2015/12/01 14:37:49.885 PST <Pooled Message Processor 2> tid=0x66] > FinishBackupRequest createResponse sending FinishBackupResponse > {noformat} > Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken > (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is > where it is released (in {{BackupLock unlockForBackup}}). > All members operate independently. If one completes its backup before the > others, the issue can happen. That member is available to start a new backup, > but the others aren't. > In this case, the member successfully processes the {{PrepareBackupRequest}} > and takes the {{BackupLock}}. > The other members do not successfully process the {{PrepareBackupRequest}} > and do not take the {{BackupLock}}: > {noformat} > [warning 2015/12/01 14:46:02.447 PST <Pooled Message Processor 3> tid=0xd7] > PrepareBackupRequest createResponse received PrepareBackupRequest > [warning 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7] > PrepareBackupResponse createResponse: About to start backup > [error 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7] > Error processing request class > com.gemstone.gemfire.admin.internal.PrepareBackupRequest. > java.io.IOException: Backup already in progress > at > com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315) > at > com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87) > at > com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28) > at > com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386) > at > com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at > com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692) > at > com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963) > at java.lang.Thread.run(Thread.java:745) > [warning 2015/12/01 14:46:02.448 PST <Pooled Message Processor 3> tid=0xd7] > PrepareBackupResponse createResponse: Returning exception > java.io.IOException: Backup already in progress > {noformat} > Once this happens, the members are out of sync. One has taken the > {{BackupLock}}; the others have not. The member that has taken the > {{BackupLock}} will never release it. > One solution would be to prevent any backup from starting if one is in > progress (in the locator). Another would be to force the member to release > the {{BackupLock}} if other members fail to get it. I think this would > require a new message, though. -- This message was sent by Atlassian JIRA (v6.3.4#6332)