Hi Uma, We have had a related issue in BOOKKEEPER-112 and there is a doc there 
describing how we deal with it. It might help to give it a look.

-Flavio

On Jun 27, 2012, at 7:06 AM, Uma Maheswara Rao G wrote:

> Right. But Current Replication process considered for OPEN ledgers also. So, 
> Ledger checker can not know whether that ensemble is just reformed by client 
> or inprogress for write.
> 
> One way is to skip the replication for Inprogress Ledgers. But Auditor may 
> need to recheck this opened ledgers periodically which ever it came across?
> 
> IMO, replicating inrprogress ledgers may create some inconsistencies.
> 
> Thanks,
> Uma
> ________________________________________
> From: Flavio Junqueira [[email protected]]
> Sent: Wednesday, June 27, 2012 4:21 AM
> To: [email protected]
> Cc: Ivan Kelly; Rakesh R
> Subject: Re: Race condition between  LedgerChecker and Ensemble reformation 
> from client
> 
> Hi Uma, It sounds like the replication worker shouldn't have written:
> 
> 401        10.18.40.155:3181        10.18.40.155:3185        10.18.40.155:3184
> 
> If I'm not missing anything, the replication worker should update an existing 
> entry in the metadata, not create a new entry.
> 
> -Flavio
> 
> On Jun 26, 2012, at 6:07 PM, Uma Maheswara Rao G wrote:
> 
>> Hi,
>> 
>> It looks there is a race between LedgerChecker and Ensemble reformation from 
>> client.
>> 
>> When one bookie failed from ensemble quoram, it will try to reform the 
>> ensemble on handleBookieFailure.
>> 
>> At this time it is reforming the ensemble and resending the write request to 
>> new bookie (which is added into new ensemble.)
>> 
>> At the same time if, If ReplicationWroker triggers on same ledger and run 
>> the LedgerChecker on it.
>> LedgerChecker may find this last failed entry also as a fragment, because 
>> ensemble change already updated in metadata.
>> 
>> If ReplicationWorker replicate this last fragment, then  
>> ChangeEnsembleCb#operationComplete will fail with Badversion, because 
>> ensemble data already updated by ReplicationWorker.
>> 
>> 
>> LOG.error("Could not resolve ledger metadata conflict while changing 
>> ensemble to: "
>>                                                     + newEnsemble + ", old 
>> meta data is \n" + new String(metadata.serialize())
>>                                                     + "\n, new meta data is 
>> \n" + new String(newMeta.serialize()) + "\n ,closing ledger");
>> 
>> 2012-06-23 10:51:47,814 - ERROR 
>> [main-EventThread:LedgerHandle$1ChangeEnsembleCb$1$1@714] - Could not 
>> resolve ledger metadata conflict while changing ensemble to: 
>> [/10.18.40.155:3182, /10.18.40.155:3185, /10.18.40.155:3184], old meta data 
>> is
>> BookieMetadataFormatVersion        1
>> 2
>> 3
>> 0
>> 0        10.18.40.155:3181        10.18.40.155:3182        10.18.40.155:3183
>> 102        10.18.40.155:3181        10.18.40.155:3185        
>> 10.18.40.155:3183
>> , new meta data is
>> BookieMetadataFormatVersion        1
>> 2
>> 3
>> 0
>> 0        10.18.40.155:3181        10.18.40.155:3182        10.18.40.155:3183
>> 102        10.18.40.155:3181        10.18.40.155:3185        
>> 10.18.40.155:3183
>> 401        10.18.40.155:3181        10.18.40.155:3185        
>> 10.18.40.155:3184
>> ,closing ledger
>> 
>> 
>> After this time, it will close the ledger. 
>> asyncCloseInternal(NoopCloseCallback.instance, null, rc);
>> 
>> Then finally ledger metadata will looks like:
>> 
>> 0        10.18.40.155:3181        10.18.40.155:3182        10.18.40.155:3183
>> 102        10.18.40.155:3181        10.18.40.155:3185        
>> 10.18.40.155:3183
>> 401        10.18.40.155:3181        10.18.40.155:3185        
>> 10.18.40.155:3184
>> 400   CLOSED
>> 
>> Because client known last succussful entry is 400. Am i missing some thing 
>> here?
>> 
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Uma
>> 
>> 
>> 
>> 

Reply via email to