[ 
https://issues.apache.org/jira/browse/GEODE-657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Smith reassigned GEODE-657:
-------------------------------

    Assignee: Dan Smith

> Two backups occurring simultaneously can cause the BackupLock to be obtained 
> but not released
> ---------------------------------------------------------------------------------------------
>
>                 Key: GEODE-657
>                 URL: https://issues.apache.org/jira/browse/GEODE-657
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Barry Oglesby
>            Assignee: Dan Smith
>
> Backup is done in three phases, namely {{FlushToDiskRequest}}, 
> {{PrepareBackupRequest}} and {{FinishBackupRequest}}.
> Here is some debugging in the locator showing the backup phases:
> {noformat}
> [warning 2015/12/01 14:37:38.783 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers STARTED
> [warning 2015/12/01 14:37:38.794 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers about to send FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers received all responses to FlushToDiskRequest
> [warning 2015/12/01 14:37:38.977 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers about to send PrepareBackupRequest
> [warning 2015/12/01 14:37:38.984 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers received all responses to PrepareBackupRequest
> [warning 2015/12/01 14:37:38.985 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers about to send FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers received all responses to FinishBackupRequest
> [warning 2015/12/01 14:37:51.184 PST locator <RMI TCP 
> Connection(3)-192.168.2.11> tid=0x99] AdminDistributedSystemImpl 
> backupAllMembers COMPLETED
> {noformat}
> Here is some debugging in one of the members. The others are similar.
> {noformat}
> [warning 2015/12/01 14:37:38.795 PST  <Pooled Message Processor 2> tid=0x66] 
> FlushToDiskRequest createResponse received FlushToDiskRequest
> [warning 2015/12/01 14:37:38.974 PST  <Pooled Message Processor 2> tid=0x66] 
> FlushToDiskRequest createResponse sending FlushToDiskResponse
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] 
> PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:37:38.978 PST  <Pooled Message Processor 2> tid=0x66] 
> PrepareBackupResponse createResponse: About to start backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] 
> PrepareBackupResponse createResponse: About to prepare backup
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] 
> BackupLock lockForBackup invoked
> [warning 2015/12/01 14:37:38.980 PST  <Pooled Message Processor 2> tid=0x66] 
> PrepareBackupRequest createResponse sending PrepareBackupResponse
> [warning 2015/12/01 14:37:38.986 PST  <Pooled Message Processor 2> tid=0x66] 
> FinishBackupRequest createResponse received FinishBackupRequest
> [warning 2015/12/01 14:37:39.970 PST  <Pooled Message Processor 2> tid=0x66] 
> BackupLock unlockForBackup invoked
> [warning 2015/12/01 14:37:49.885 PST  <Pooled Message Processor 2> tid=0x66] 
> FinishBackupRequest createResponse sending FinishBackupResponse
> {noformat}
> Processing the {{PrepareBackupRequest}} is where the {{BackupLock}} is taken 
> (in {{BackupLock lockForBackup}}). Processing the {{FinishBackupRequest}} is 
> where it is released (in {{BackupLock unlockForBackup}}).
> All members operate independently. If one completes its backup before the 
> others, the issue can happen. That member is available to start a new backup, 
> but the others aren't.
> In this case, the member successfully processes the {{PrepareBackupRequest}} 
> and takes the {{BackupLock}}.
> The other members do not successfully process the {{PrepareBackupRequest}} 
> and do not take the {{BackupLock}}:
> {noformat}
> [warning 2015/12/01 14:46:02.447 PST  <Pooled Message Processor 3> tid=0xd7] 
> PrepareBackupRequest createResponse received PrepareBackupRequest
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] 
> PrepareBackupResponse createResponse: About to start backup
> [error 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] 
> Error processing request class 
> com.gemstone.gemfire.admin.internal.PrepareBackupRequest.
> java.io.IOException: Backup already in progress
>       at 
> com.gemstone.gemfire.internal.cache.GemFireCacheImpl.startBackup(GemFireCacheImpl.java:4315)
>       at 
> com.gemstone.gemfire.admin.internal.PrepareBackupRequest.createResponse(PrepareBackupRequest.java:87)
>       at 
> com.gemstone.gemfire.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:28)
>       at 
> com.gemstone.gemfire.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:386)
>       at 
> com.gemstone.gemfire.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:457)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at 
> com.gemstone.gemfire.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:692)
>       at 
> com.gemstone.gemfire.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:963)
>       at java.lang.Thread.run(Thread.java:745)
> [warning 2015/12/01 14:46:02.448 PST  <Pooled Message Processor 3> tid=0xd7] 
> PrepareBackupResponse createResponse: Returning exception 
> java.io.IOException: Backup already in progress
> {noformat}
> Once this happens, the members are out of sync. One has taken the 
> {{BackupLock}}; the others have not. The member that has taken the 
> {{BackupLock}} will never release it.
> One solution would be to prevent any backup from starting if one is in 
> progress (in the locator). Another would be to force the member to release 
> the {{BackupLock}} if other members fail to get it. I think this would 
> require a new message, though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to